Add vLLM + LiteLLM support; rename script; add README

- Replace Ollama (disabled by default) with vLLM Docker container + LiteLLM Anthropic-API proxy as the default inference backend - vLLM setup: pulls vllm/vllm-openai, starts container on port 11434, polls until model is loaded (up to 10 min for first 45 GB download) - LiteLLM setup: installs in Python venv, writes config mapping Claude model aliases to the vLLM model, runs as a systemd service on port 4000 - New CLI flags on `create`: --vllm/--no-vllm, --ollama/--no-ollama to override config at runtime - New `test` command: end-to-end inference test over WireGuard against vLLM (/v1/models + /v1/chat/completions) and LiteLLM (/v1/messages) - UFW rules now open both port 11434 (inference) and 4000 (LiteLLM) from the WireGuard subnet - Rename hyperstack_vm.rb → hyperstack.rb - Add README.md with quickstart, Claude Code / OpenCode usage, CLI reference, monitoring commands, and VRAM sizing notes - Add vllm-setup.txt: detailed manual setup notes and architecture docs Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
author: Paul Buetow <paul@buetow.org> 2026-03-18 09:10:14 +0200
committer: Paul Buetow <paul@buetow.org> 2026-03-18 09:10:14 +0200
commit: d8575832ae0022f94cd786b15f8b88de0bf18672 (patch)
tree: 75872514846cfddb1434281a59b6673344023ff7 /snippets/hyperstack
parent: 8dca92ea40b191b9de367197aac7e1f882ed3d43 (diff)
4 files changed, 1059 insertions, 17 deletions
diff --git a/snippets/hyperstack/README.md b/snippets/hyperstack/README.md
new file mode 100644
index 0000000..e5cc7ea
--- /dev/null
+++ b/snippets/hyperstack/README.md
@@ -0,0 +1,157 @@
+# hyperstack
+
+Automates Hyperstack GPU VM lifecycle: create, bootstrap, WireGuard tunnel, vLLM inference, LiteLLM proxy.
+
+## Architecture
+
+```
+Claude Code (local)                    Hyperstack VM (A100 80GB)
+┌─────────────────┐                   ┌──────────────────────────────────┐
+│ claude CLI       │── Anthropic API ─▶│ LiteLLM proxy (:4000)           │
+│                  │   /v1/messages    │   Anthropic → OpenAI translation │
+│                  │   via WireGuard   │             │                    │
+└─────────────────┘                   │             ▼                    │
+                                      │ vLLM engine (:11434)            │
+OpenCode (local)                      │   bullpoint/Qwen3-Coder-Next-   │
+┌─────────────────┐                   │   AWQ-4bit (45 GB, MoE 80B)     │
+│ opencode         │── OpenAI API ────▶│   FlashAttention v2             │
+│                  │   /v1/chat/...    │   prefix caching                │
+└─────────────────┘                   └──────────────────────────────────┘
+```
+
+Both local clients connect over a WireGuard tunnel (`wg1`, subnet `192.168.3.0/24`).
+The VM gets `192.168.3.1`; your local machine gets `192.168.3.2`.
+
+## Prerequisites
+
+- Hyperstack account with API key in `~/.hyperstack`
+- SSH key registered in Hyperstack as `earth` (or change `ssh.hyperstack_key_name` in the TOML)
+- WireGuard setup script: `wg1-setup.sh` (present in this directory)
+- Ruby with `toml-rb` gem: `bundle install`
+
+## Quickstart
+
+```bash
+# Deploy VM, set up WireGuard + vLLM + LiteLLM (~10 min on first run)
+ruby hyperstack.rb create
+
+# Verify everything is working
+ruby hyperstack.rb test
+
+# Use Claude Code against the local vLLM
+ANTHROPIC_BASE_URL=http://192.168.3.1:4000 \
+ANTHROPIC_API_KEY=sk-litellm-master \
+claude --model claude-opus-4-6-20260604 --dangerously-skip-permissions
+
+# Tear down
+ruby hyperstack.rb delete
+```
+
+## Using Claude Code with vLLM
+
+WireGuard (`wg1`) must be active before connecting.
+
+```bash
+ANTHROPIC_BASE_URL=http://192.168.3.1:4000 \
+ANTHROPIC_API_KEY=sk-litellm-master \
+claude --model claude-opus-4-6-20260604 --dangerously-skip-permissions
+```
+
+If you see an **"Auth conflict"** warning, clear the saved claude.ai session first:
+
+```bash
+claude /logout
+```
+
+**Fish shell alias** (add to `~/.config/fish/config.fish`):
+
+```fish
+alias claude-local='ANTHROPIC_BASE_URL=http://192.168.3.1:4000 \
+  ANTHROPIC_API_KEY=sk-litellm-master \
+  claude --model claude-opus-4-6-20260604 --dangerously-skip-permissions'
+```
+
+**Available model aliases** — all map to the same vLLM model:
+
+| Alias | Use case |
+|-------|----------|
+| `claude-opus-4-6-20260604` | Recommended (most future-proof) |
+| `claude-opus-4-20250514` | |
+| `claude-sonnet-4-20250514` | |
+| `claude-haiku-3-5-20241022` | |
+
+Add new Anthropic model IDs to `vllm.litellm_claude_model_names` in `hyperstack-vm.toml` as they are released.
+
+## Using OpenCode with vLLM
+
+OpenCode speaks OpenAI natively — connect directly to vLLM, no LiteLLM needed:
+
+```bash
+OPENAI_BASE_URL=http://192.168.3.1:11434/v1 \
+OPENAI_API_KEY=EMPTY \
+opencode
+```
+
+Set the model name to `bullpoint/Qwen3-Coder-Next-AWQ-4bit` in your OpenCode config.
+
+## CLI reference
+
+```
+ruby hyperstack.rb [--config path] <command> [options]
+
+Commands:
+  create   Deploy a new VM and run full provisioning
+  delete   Destroy the tracked VM
+  status   Show VM and WireGuard status
+  test     Run end-to-end inference tests (vLLM + LiteLLM)
+
+create options:
+  --replace          Delete existing tracked VM before creating
+  --dry-run          Print the plan without making changes
+  --vllm / --no-vllm    Override config: enable/disable vLLM+LiteLLM setup
+  --ollama / --no-ollama Override config: enable/disable Ollama setup
+```
+
+## Configuration
+
+Edit `hyperstack-vm.toml` to change defaults. Key sections:
+
+| Section | Purpose |
+|---------|---------|
+| `[vm]` | Flavor, image, environment name |
+| `[vllm]` | Model, container settings, LiteLLM key and Claude aliases |
+| `[ollama]` | Ollama settings (disabled by default; set `install = true` to use instead) |
+| `[network]` | Ports, WireGuard subnet, allowed CIDRs |
+| `[wireguard]` | Auto-setup script path |
+
+## Monitoring vLLM
+
+```bash
+# Live engine stats (throughput, KV cache, prefix cache hit rate)
+ssh ubuntu@<vm-ip> 'docker logs -f vllm_qwen3 2>&1 | grep "Engine 000"'
+
+# Last 1 minute of stats
+ssh ubuntu@<vm-ip> 'docker logs --since 1m vllm_qwen3 2>&1 | grep "Engine 000"'
+
+# GPU stats (every 5 s)
+ssh ubuntu@<vm-ip> 'nvidia-smi --query-gpu=temperature.gpu,utilization.gpu,power.draw,memory.used --format=csv -l 5'
+
+# LiteLLM proxy log
+ssh ubuntu@<vm-ip> 'sudo journalctl -fu litellm'
+```
+
+Healthy baseline (A100 80GB PCIe, qwen3-coder-next AWQ 4-bit):
+
+| Metric | Expected |
+|--------|----------|
+| Prefill throughput | 5,000–11,000 tok/s |
+| Decode throughput | 40–99 tok/s |
+| KV cache usage | 2–5% for typical sessions |
+| Prefix cache hit (Claude Code) | 0% (expected — prompt prefix mutates each turn) |
+| Prefix cache hit (OpenCode) | >50% after warm-up |
+
+## Switching models
+
+Stop the current container, start a new one with a different `--model`, then update `vllm.model` in `hyperstack-vm.toml` and re-run `ruby hyperstack.rb create` to reinstall LiteLLM with the updated config.
+
+See `vllm-setup.txt` for detailed vLLM and LiteLLM setup notes, VRAM sizing guide, and troubleshooting.
diff --git a/snippets/hyperstack/hyperstack-vm.toml b/snippets/hyperstack/hyperstack-vm.toml
index 2d83b0f..0ea3cfc 100644
--- a/snippets/hyperstack/hyperstack-vm.toml
+++ b/snippets/hyperstack/hyperstack-vm.toml
@@ -31,7 +31,10 @@ connect_timeout_sec = 10
 [network]
 wireguard_udp_port = 56710
 wireguard_subnet = "192.168.3.0/24"
+# Port 11434 is shared by both Ollama and vLLM for firewall compatibility.
 ollama_port = 11434
+# Port 4000: LiteLLM Anthropic-API proxy (used with vLLM).
+litellm_port = 4000
 allowed_ssh_cidrs = ["0.0.0.0/0"]
 allowed_wireguard_cidrs = ["0.0.0.0/0"]
 
@@ -42,13 +45,36 @@ configure_ufw = true
 configure_ollama_host = false
 
 [ollama]
-install = true
+# Disabled in favour of vLLM; set install = true to switch back to Ollama.
+install = false
 models_dir = "/ephemeral/ollama/models"
 listen_host = "0.0.0.0:11434"
 gpu_overhead_mb = 2000
-num_parallel = 4
+num_parallel = 1
+context_length = 32768
 pull_models = ["qwen3-coder-next", "qwen3-coder:30b", "gpt-oss:20b", "gpt-oss:120b", "nemotron-3-super"]
 
+# vLLM serves one model via Docker; LiteLLM translates Anthropic API → OpenAI.
+# Use --vllm / --no-vllm CLI flags to override install at runtime.
+[vllm]
+install = true
+model = "bullpoint/Qwen3-Coder-Next-AWQ-4bit"
+# HuggingFace model cache on ephemeral NVMe (fast; survives reboots on most providers).
+hug_cache_dir = "/ephemeral/hug"
+container_name = "vllm_qwen3"
+max_model_len = 262144
+gpu_memory_utilization = 0.92
+tensor_parallel_size = 1
+tool_call_parser = "qwen3_coder"
+# LiteLLM maps each entry to the vLLM model; add new Anthropic model IDs here.
+litellm_master_key = "sk-litellm-master"
+litellm_claude_model_names = [
+  "claude-sonnet-4-20250514",
+  "claude-opus-4-20250514",
+  "claude-opus-4-6-20260604",
+  "claude-haiku-3-5-20241022"
+]
+
 [wireguard]
 auto_setup = true
 setup_script = "./wg1-setup.sh"
diff --git a/snippets/hyperstack/hyperstack_vm.rb b/snippets/hyperstack/hyperstack.rb
index ac60da9..c84d013 100644
--- a/snippets/hyperstack/hyperstack_vm.rb
+++ b/snippets/hyperstack/hyperstack.rb
@@ -62,7 +62,8 @@ module HyperstackVM
       'network' => {
         'wireguard_udp_port' => 56_710,
         'wireguard_subnet' => '192.168.3.0/24',
-        'ollama_port' => 11_434,
+        'ollama_port' => 11_434,  # reused by vLLM for firewall compatibility
+        'litellm_port' => 4_000,
         'allowed_ssh_cidrs' => ['0.0.0.0/0'],
         'allowed_wireguard_cidrs' => ['0.0.0.0/0']
       },
@@ -73,13 +74,34 @@ module HyperstackVM
         'configure_ollama_host' => false
       },
       'ollama' => {
-        'install' => true,
+        # Disabled in favour of vLLM; set install: true to use Ollama instead.
+        'install' => false,
         'models_dir' => '/ephemeral/ollama/models',
         'listen_host' => '0.0.0.0:11434',
         'gpu_overhead_mb' => 2000,
-        'num_parallel' => 4,
+        'num_parallel' => 1,
+        'context_length' => 32_768,
         'pull_models' => ['qwen3-coder:30b', 'gpt-oss:20b', 'gpt-oss:120b', 'nemotron-3-super']
       },
+      'vllm' => {
+        # vLLM serves one model via Docker; LiteLLM translates Anthropic API → OpenAI chat completions.
+        'install' => true,
+        'model' => 'bullpoint/Qwen3-Coder-Next-AWQ-4bit',
+        'hug_cache_dir' => '/ephemeral/hug',
+        'container_name' => 'vllm_qwen3',
+        'max_model_len' => 262_144,
+        'gpu_memory_utilization' => 0.92,
+        'tensor_parallel_size' => 1,
+        'tool_call_parser' => 'qwen3_coder',
+        # LiteLLM maps each Claude model alias to the vLLM model; add new Anthropic IDs here.
+        'litellm_claude_model_names' => %w[
+          claude-sonnet-4-20250514
+          claude-opus-4-20250514
+          claude-opus-4-6-20260604
+          claude-haiku-3-5-20241022
+        ],
+        'litellm_master_key' => 'sk-litellm-master'
+      },
       'wireguard' => {
         'auto_setup' => true,
         'setup_script' => './wg1-setup.sh'
@@ -216,6 +238,19 @@ module HyperstackVM
       Integer(fetch('network', 'ollama_port'))
     end
 
+    def litellm_port
+      Integer(fetch('network', 'litellm_port'))
+    end
+
+    # Derives the VM's WireGuard IP as the first host in the subnet (network + 1).
+    # E.g. 192.168.3.0/24 → 192.168.3.1
+    def wireguard_gateway_ip
+      base = IPAddr.new(wireguard_subnet).to_s
+      parts = base.split('.').map(&:to_i)
+      parts[-1] += 1
+      parts.join('.')
+    end
+
     def allowed_ssh_cidrs
       Array(fetch('network', 'allowed_ssh_cidrs')).map(&:to_s)
     end
@@ -260,10 +295,58 @@ module HyperstackVM
       Integer(fetch('ollama', 'num_parallel'))
     end
 
+    # Maximum context length for Ollama inference; keeps KV cache bounded
+    # on single-GPU setups to avoid slow prefill at large context sizes.
+    def ollama_context_length
+      Integer(fetch('ollama', 'context_length'))
+    end
+
     def ollama_pull_models
       Array(fetch('ollama', 'pull_models')).map(&:to_s)
     end
 
+    def vllm_install_enabled?
+      truthy?(fetch('vllm', 'install'))
+    end
+
+    def vllm_model
+      fetch('vllm', 'model')
+    end
+
+    def vllm_hug_cache_dir
+      fetch('vllm', 'hug_cache_dir')
+    end
+
+    def vllm_container_name
+      fetch('vllm', 'container_name')
+    end
+
+    def vllm_max_model_len
+      Integer(fetch('vllm', 'max_model_len'))
+    end
+
+    def vllm_gpu_memory_utilization
+      Float(fetch('vllm', 'gpu_memory_utilization'))
+    end
+
+    def vllm_tensor_parallel_size
+      Integer(fetch('vllm', 'tensor_parallel_size'))
+    end
+
+    def vllm_tool_call_parser
+      fetch('vllm', 'tool_call_parser')
+    end
+
+    # Claude model aliases that LiteLLM maps to the vLLM model.
+    # Must match what Claude Code sends in the model field.
+    def litellm_claude_model_names
+      Array(fetch('vllm', 'litellm_claude_model_names')).map(&:to_s)
+    end
+
+    def litellm_master_key
+      fetch('vllm', 'litellm_master_key')
+    end
+
     def local_client_checks_enabled?
       truthy?(fetch('local_client', 'check_wg1_service'))
     end
@@ -295,14 +378,17 @@ module HyperstackVM
         rules << firewall_rule('udp', wireguard_udp_port, cidr)
       end
 
+      # Port 11434: shared by Ollama and vLLM (WireGuard-subnet-restricted).
       rules << firewall_rule('tcp', ollama_port, wireguard_subnet)
+      # Port 4000: LiteLLM Anthropic-API proxy (WireGuard-subnet-restricted).
+      rules << firewall_rule('tcp', litellm_port, wireguard_subnet)
       rules.uniq
     end
 
     private
 
     def validate!
-      %w[auth hyperstack state vm ssh network bootstrap ollama wireguard local_client].each do |section|
+      %w[auth hyperstack state vm ssh network bootstrap ollama vllm wireguard local_client].each do |section|
         raise Error, "Missing config section [#{section}]" unless @data.key?(section)
       end
 
@@ -619,7 +705,10 @@ module HyperstackVM
       @out = out
     end
 
-    def create(replace: false, dry_run: false)
+    def create(replace: false, dry_run: false, install_vllm: nil, install_ollama: nil)
+      # CLI flags override config; nil means "use config default".
+      @effective_vllm = install_vllm.nil? ? @config.vllm_install_enabled? : install_vllm
+      @effective_ollama = install_ollama.nil? ? @config.ollama_install_enabled? : install_ollama
       existing_state = @state_store.load
       if existing_state && existing_state['vm_id']
         if replace
@@ -721,10 +810,36 @@ module HyperstackVM
       print_local_wireguard_summary(state&.dig('public_ip'))
     end
 
+    # Runs end-to-end inference tests against vLLM and LiteLLM over WireGuard.
+    # Requires wg1 to be active and the VM to be fully provisioned.
+    def test
+      state = @state_store.load
+      raise Error, "No tracked VM state file found at #{@state_store.path}." if state.nil?
+
+      wg_ip = @config.wireguard_gateway_ip
+      info "Running end-to-end inference tests via WireGuard (#{wg_ip})..."
+
+      if @config.vllm_install_enabled?
+        test_vllm(wg_ip)
+        test_litellm(wg_ip)
+      end
+
+      if @config.ollama_install_enabled?
+        info "  Ollama test: connect via SSH and run 'ollama list' to verify models."
+      end
+
+      info 'All inference tests passed.'
+    end
+
     private
 
     def resumable_state?(state)
-      state['vm_id'] && (state['bootstrapped_at'].nil? || ollama_setup_needed?(state) || wireguard_setup_needed?(state))
+      state['vm_id'] && (
+        state['bootstrapped_at'].nil? ||
+        ollama_setup_needed?(state) ||
+        vllm_setup_needed?(state) ||
+        wireguard_setup_needed?(state)
+      )
     end
 
     def continue_create(state)
@@ -747,7 +862,7 @@ module HyperstackVM
       # Install Ollama binary and configure the service (fast), but defer
       # model pulls until after the WireGuard tunnel is up so that the user
       # can monitor progress over the tunnel.
-      if @config.ollama_install_enabled? && state['ollama_installed_at'].nil?
+      if effective_ollama? && state['ollama_installed_at'].nil?
         install_ollama_service(state['public_ip'])
         state['ollama_installed_at'] = Time.now.utc.iso8601
         @state_store.save(state)
@@ -759,7 +874,7 @@ module HyperstackVM
         @state_store.save(state)
       end
 
-      # Pull and verify models after the tunnel is established
+      # Pull and verify Ollama models after the tunnel is established.
       if ollama_setup_needed?(state)
         pull_ollama_models(state['public_ip'])
         state['ollama_setup_at'] = Time.now.utc.iso8601
@@ -768,6 +883,15 @@ module HyperstackVM
         @state_store.save(state)
       end
 
+      # Set up vLLM (Docker container) + LiteLLM (Anthropic-API proxy) after
+      # the tunnel is up so that model-download progress is visible locally.
+      if vllm_setup_needed?(state)
+        setup_vllm_stack(state['public_ip'])
+        state['vllm_setup_at'] = Time.now.utc.iso8601
+        state['vllm_model'] = @config.vllm_model
+        @state_store.save(state)
+      end
+
       vm = @client.get_vm(vm_id)
       state['security_rules'] = Array(vm['security_rules']).map { |rule| normalize_rule(rule) }
       state['status'] = vm['status']
@@ -777,6 +901,12 @@ module HyperstackVM
 
       info "VM ready: #{state['public_ip']} (id=#{state['vm_id']})"
       print_local_wireguard_summary(state['public_ip'])
+      if effective_vllm?
+        wg_ip = @config.wireguard_gateway_ip
+        info "Run 'ruby hyperstack.rb test' to verify vLLM and LiteLLM."
+        info "  vLLM:    http://#{wg_ip}:#{@config.ollama_port}/v1/models"
+        info "  LiteLLM: http://#{wg_ip}:#{@config.litellm_port}/v1/messages"
+      end
     end
 
     def build_create_payload(vm_name, resolved)
@@ -897,7 +1027,7 @@ module HyperstackVM
     end
 
     def ollama_setup_needed?(state)
-      return false unless @config.ollama_install_enabled?
+      return false unless effective_ollama?
       # Re-run setup if state has no record, or if desired models changed
       return true if state['ollama_setup_at'].nil?
 
@@ -1108,12 +1238,18 @@ module HyperstackVM
       else
         info 'Guest bootstrap is disabled in config.'
       end
-      if @config.ollama_install_enabled?
+      if effective_ollama?
         info "Ollama will be installed with models stored under #{@config.ollama_models_dir}"
         unless desired_ollama_models.empty?
           info "Ollama models to pre-pull: #{desired_ollama_models.join(', ')}"
         end
       end
+      if effective_vllm?
+        info "vLLM will be installed: #{@config.vllm_model}"
+        info "  Container: #{@config.vllm_container_name}, port #{@config.ollama_port}, max_model_len #{@config.vllm_max_model_len}"
+        info "LiteLLM proxy will be installed on port #{@config.litellm_port}"
+        info "  Claude model aliases: #{@config.litellm_claude_model_names.join(', ')}"
+      end
       if @config.wireguard_auto_setup?
         info "WireGuard auto-setup script: #{@config.wireguard_setup_script} <vm_public_ip>"
       end
@@ -1139,6 +1275,10 @@ module HyperstackVM
           info "Ollama models to pre-pull: #{desired_ollama_models.join(', ')}"
         end
       end
+      if vllm_setup_needed?(state)
+        info "vLLM would be installed: #{@config.vllm_model}"
+        info "LiteLLM proxy would be installed on port #{@config.litellm_port}"
+      end
       if wireguard_setup_needed?(state)
         info "WireGuard auto-setup script would run: #{@config.wireguard_setup_script} #{state['public_ip'] || '<pending-public-ip>'}"
       end
@@ -1197,7 +1337,10 @@ module HyperstackVM
         script << "sudo ufw allow #{@config.ssh_port}/tcp comment 'Allow SSH' >/dev/null 2>&1 || true"
         script << 'sudo ufw --force enable >/dev/null 2>&1 || true'
         script << "sudo ufw allow #{@config.wireguard_udp_port}/udp comment 'WireGuard #{@config.local_interface_name}' >/dev/null 2>&1 || true"
-        script << "sudo ufw allow from #{Shellwords.escape(@config.wireguard_subnet)} to any port #{@config.ollama_port} proto tcp comment 'Ollama via #{@config.local_interface_name}' >/dev/null 2>&1 || true"
+        # Port 11434 is shared by Ollama and vLLM; open for both regardless of which is installed.
+        script << "sudo ufw allow from #{Shellwords.escape(@config.wireguard_subnet)} to any port #{@config.ollama_port} proto tcp comment 'Inference API (Ollama/vLLM) via #{@config.local_interface_name}' >/dev/null 2>&1 || true"
+        # Port 4000: LiteLLM proxy (Anthropic API → vLLM); open alongside the inference port.
+        script << "sudo ufw allow from #{Shellwords.escape(@config.wireguard_subnet)} to any port #{@config.litellm_port} proto tcp comment 'LiteLLM proxy via #{@config.local_interface_name}' >/dev/null 2>&1 || true"
       end
 
       if @config.configure_ollama_host?
@@ -1260,6 +1403,7 @@ module HyperstackVM
       script << "Environment=\"OLLAMA_MODELS=#{models_dir}\""
       script << "Environment=\"OLLAMA_GPU_OVERHEAD=#{@config.ollama_gpu_overhead_mb}\""
       script << "Environment=\"OLLAMA_NUM_PARALLEL=#{@config.ollama_num_parallel}\""
+      script << "Environment=\"OLLAMA_CONTEXT_LENGTH=#{@config.ollama_context_length}\""
       script << "Environment=\"OLLAMA_HOST=#{listen_host}\""
       script << 'OVERRIDE'
       script << 'sudo systemctl daemon-reload'
@@ -1302,6 +1446,225 @@ module HyperstackVM
       script.join("\n")
     end
 
+    # Returns the effective Ollama flag: CLI override if set, else config default.
+    def effective_ollama?
+      defined?(@effective_ollama) ? @effective_ollama : @config.ollama_install_enabled?
+    end
+
+    # Returns the effective vLLM flag: CLI override if set, else config default.
+    def effective_vllm?
+      defined?(@effective_vllm) ? @effective_vllm : @config.vllm_install_enabled?
+    end
+
+    def vllm_setup_needed?(state)
+      return false unless effective_vllm?
+      # Re-run if never set up, or if the configured model changed since last setup.
+      return true if state['vllm_setup_at'].nil?
+
+      state['vllm_model'] != @config.vllm_model
+    end
+
+    def setup_vllm_stack(host)
+      info "Setting up vLLM Docker container on #{host}..."
+      output, status = run_ssh_command_streaming(host, vllm_install_script)
+      raise Error, "vLLM install failed: #{output.strip}" unless status.success?
+
+      info "Setting up LiteLLM Anthropic-API proxy on #{host}..."
+      output, status = run_ssh_command_streaming(host, litellm_install_script)
+      raise Error, "LiteLLM install failed: #{output.strip}" unless status.success?
+    end
+
+    # Generates the remote shell script that pulls the vLLM Docker image, starts
+    # the container, and polls until the model is fully loaded (up to 10 minutes
+    # to cover the first-run ~45 GB model download).
+    def vllm_install_script
+      model     = @config.vllm_model
+      cache_dir = @config.vllm_hug_cache_dir
+      container = @config.vllm_container_name
+      max_len   = @config.vllm_max_model_len
+      gpu_util  = @config.vllm_gpu_memory_utilization
+      tp_size   = @config.vllm_tensor_parallel_size
+      parser    = @config.vllm_tool_call_parser
+      port      = @config.ollama_port  # vLLM reuses the Ollama port for firewall compat
+
+      docker_run = [
+        'docker run -d',
+        '--gpus all', '--ipc=host', '--network host',
+        "--name #{Shellwords.escape(container)}",
+        '--restart always',
+        "-v #{Shellwords.escape(cache_dir)}:/root/.cache/huggingface",
+        'vllm/vllm-openai:latest',
+        "--model #{Shellwords.escape(model)}",
+        "--tensor-parallel-size #{tp_size}",
+        '--enable-auto-tool-choice',
+        "--tool-call-parser #{Shellwords.escape(parser)}",
+        '--enable-prefix-caching',
+        "--gpu-memory-utilization #{gpu_util}",
+        "--max-model-len #{max_len}",
+        '--host 0.0.0.0',
+        "--port #{port}"
+      ].join(' ')
+
+      script = []
+      script << 'set -euo pipefail'
+      script << "sudo mkdir -p #{Shellwords.escape(cache_dir)}"
+      script << "sudo chmod -R 0777 #{Shellwords.escape(cache_dir)}"
+      # Stop and remove any existing container so re-runs are idempotent.
+      script << "docker stop #{Shellwords.escape(container)} 2>/dev/null || true"
+      script << "docker rm #{Shellwords.escape(container)} 2>/dev/null || true"
+      script << 'docker pull vllm/vllm-openai:latest'
+      script << docker_run
+      # Poll until the model is loaded:
+      #   first run:    ~45 GB download (~2.5 min) + model load (~65 s) + CUDA graphs (~35 s) ≈ 4-5 min
+      #   warm restart: model load + CUDA graphs ≈ 100 s
+      # Timeout: 120 × 5 s = 10 minutes
+      script << 'echo "Waiting for vLLM to become ready (up to 10 min for first model download)..."'
+      script << "for i in $(seq 1 120); do"
+      script << "  if curl -sf http://localhost:#{port}/v1/models >/dev/null 2>&1; then echo vllm-ready; break; fi"
+      script << "  state=$(docker inspect --format='{{.State.Status}}' #{Shellwords.escape(container)} 2>/dev/null || echo unknown)"
+      script << '  echo "  vLLM not ready yet ($i/120, container=$state)..."'
+      script << '  sleep 5'
+      script << 'done'
+      script << "curl -sf http://localhost:#{port}/v1/models >/dev/null || { echo 'FATAL: vLLM did not become ready within 10 minutes'; exit 1; }"
+      script << 'echo vllm-install-ok'
+      script.join("\n")
+    end
+
+    # Generates the remote shell script that installs LiteLLM in a Python venv,
+    # writes a config mapping Claude model aliases to the vLLM endpoint, and
+    # starts the proxy as a systemd service on litellm_port.
+    def litellm_install_script
+      port        = @config.litellm_port
+      vllm_port   = @config.ollama_port
+      model       = @config.vllm_model
+      claude_names = @config.litellm_claude_model_names
+      master_key  = @config.litellm_master_key
+
+      # Build model_list YAML entries; each Claude alias maps to the vLLM model.
+      # "hosted_vllm/" prefix forces LiteLLM to use /v1/chat/completions (not /v1/responses).
+      model_entries = claude_names.flat_map do |name|
+        [
+          "  - model_name: \"#{name}\"",
+          '    litellm_params:',
+          "      model: \"hosted_vllm/#{model}\"",
+          "      api_base: \"http://localhost:#{vllm_port}/v1\"",
+          '      api_key: "EMPTY"'
+        ]
+      end
+
+      script = []
+      script << 'set -euo pipefail'
+      script << 'sudo apt-get install -y python3.12-venv'
+      script << 'sudo mkdir -p /ephemeral/litellm-env'
+      script << 'sudo chown ubuntu:ubuntu /ephemeral/litellm-env'
+      script << 'python3 -m venv /ephemeral/litellm-env'
+      script << '/ephemeral/litellm-env/bin/pip install --quiet "litellm[proxy]"'
+
+      # Write litellm-config.yaml via heredoc; drop_params silently discards
+      # Claude-specific params (e.g. context_management) that vLLM ignores.
+      script << "sudo tee /ephemeral/litellm-config.yaml > /dev/null << 'LITELLM_YAML'"
+      script << 'model_list:'
+      script.concat(model_entries)
+      script << ''
+      script << 'litellm_settings:'
+      script << '  drop_params: true'
+      script << ''
+      script << 'general_settings:'
+      script << "  master_key: \"#{master_key}\""
+      script << 'LITELLM_YAML'
+
+      # Write systemd unit via heredoc; restart on failure so transient crashes self-heal.
+      script << "sudo tee /etc/systemd/system/litellm.service > /dev/null << 'LITELLM_UNIT'"
+      script << '[Unit]'
+      script << 'Description=LiteLLM Proxy'
+      script << 'After=network.target docker.service'
+      script << 'Requires=docker.service'
+      script << ''
+      script << '[Service]'
+      script << 'Type=simple'
+      script << 'User=ubuntu'
+      script << "ExecStart=/ephemeral/litellm-env/bin/litellm --config /ephemeral/litellm-config.yaml --host 0.0.0.0 --port #{port}"
+      script << 'Restart=always'
+      script << 'RestartSec=5'
+      script << ''
+      script << '[Install]'
+      script << 'WantedBy=multi-user.target'
+      script << 'LITELLM_UNIT'
+
+      script << 'sudo systemctl daemon-reload'
+      script << 'sudo systemctl enable --now litellm'
+      script << 'sleep 5'
+      script << 'systemctl is-active --quiet litellm'
+      script << 'echo litellm-install-ok'
+      script.join("\n")
+    end
+
+    # Tests the vLLM OpenAI-compatible API: lists loaded models and runs a
+    # short inference request to confirm the model accepts requests.
+    def test_vllm(wg_ip)
+      port  = @config.ollama_port
+      model = @config.vllm_model
+
+      info "  Testing vLLM models list at http://#{wg_ip}:#{port}/v1/models..."
+      uri  = URI("http://#{wg_ip}:#{port}/v1/models")
+      resp = Net::HTTP.get_response(uri)
+      raise Error, "vLLM /v1/models returned HTTP #{resp.code}" unless resp.code == '200'
+
+      models = JSON.parse(resp.body).fetch('data', []).map { |m| m['id'] }
+      raise Error, "vLLM returned an empty model list (expected #{model})" if models.empty?
+
+      info "    Models loaded: #{models.join(', ')}"
+      info "  Testing vLLM inference..."
+      reply = vllm_chat(wg_ip, port, model, 'Say hello in five words.')
+      info "    vLLM response: #{reply}"
+    rescue Errno::ECONNREFUSED, Errno::EHOSTUNREACH, SocketError => e
+      raise Error, "Cannot reach vLLM at #{wg_ip}:#{port} — is WireGuard (wg1) active? (#{e.message})"
+    end
+
+    # Tests the LiteLLM proxy using the Anthropic Messages API format,
+    # which is what Claude Code sends when pointed at a custom base URL.
+    def test_litellm(wg_ip)
+      port  = @config.litellm_port
+      model = @config.litellm_claude_model_names.first
+      key   = @config.litellm_master_key
+
+      info "  Testing LiteLLM proxy at http://#{wg_ip}:#{port}/v1/messages..."
+      uri = URI("http://#{wg_ip}:#{port}/v1/messages")
+      req = Net::HTTP::Post.new(uri)
+      req['Content-Type'] = 'application/json'
+      req['x-api-key'] = key
+      req['anthropic-version'] = '2023-06-01'
+      req.body = JSON.generate(
+        'model' => model,
+        'max_tokens' => 50,
+        'messages' => [{ 'role' => 'user', 'content' => 'Say hello in five words.' }]
+      )
+      resp = Net::HTTP.start(uri.host, uri.port, open_timeout: 10, read_timeout: 120) { |h| h.request(req) }
+      raise Error, "LiteLLM returned HTTP #{resp.code}: #{resp.body}" unless resp.code == '200'
+
+      text = JSON.parse(resp.body).fetch('content', []).find { |b| b['type'] == 'text' }&.dig('text').to_s.strip
+      info "    LiteLLM response: #{text}"
+    rescue Errno::ECONNREFUSED, Errno::EHOSTUNREACH, SocketError => e
+      raise Error, "Cannot reach LiteLLM at #{wg_ip}:#{port} — is WireGuard (wg1) active? (#{e.message})"
+    end
+
+    # Sends a single OpenAI chat completion request and returns the reply text.
+    def vllm_chat(host, port, model, prompt)
+      uri = URI("http://#{host}:#{port}/v1/chat/completions")
+      req = Net::HTTP::Post.new(uri)
+      req['Content-Type'] = 'application/json'
+      req['Authorization'] = 'Bearer EMPTY'
+      req.body = JSON.generate(
+        'model' => model,
+        'messages' => [{ 'role' => 'user', 'content' => prompt }],
+        'max_tokens' => 50
+      )
+      resp = Net::HTTP.start(uri.host, uri.port, open_timeout: 10, read_timeout: 120) { |h| h.request(req) }
+      raise Error, "vLLM inference returned HTTP #{resp.code}" unless resp.code == '200'
+
+      JSON.parse(resp.body).dig('choices', 0, 'message', 'content').to_s.strip
+    end
+
     def integer_or_nil(value)
       value.nil? ? nil : Integer(value)
     end
@@ -1347,7 +1710,7 @@ module HyperstackVM
       }
 
       global_parser = OptionParser.new do |opts|
-        opts.banner = 'Usage: ruby hyperstack_vm.rb [--config path] <create|delete|status> [options]'
+        opts.banner = 'Usage: ruby hyperstack.rb [--config path] <create|delete|status> [options]'
         opts.on('--config PATH', "Path to TOML config (default: #{global[:config_path]})") do |value|
           global[:config_path] = value
         end
@@ -1355,9 +1718,10 @@ module HyperstackVM
           puts opts
           puts
           puts 'Commands:'
-          puts '  create [--replace] [--dry-run]'
+          puts '  create [--replace] [--dry-run] [--vllm|--no-vllm] [--ollama|--no-ollama]'
           puts '  delete [--vm-id ID] [--dry-run]'
           puts '  status'
+          puts '  test'
           exit 0
         end
       end
@@ -1384,12 +1748,18 @@ module HyperstackVM
       when 'create'
         replace = false
         dry_run = false
+        install_vllm = nil
+        install_ollama = nil
         parser = OptionParser.new do |opts|
           opts.on('--replace', 'Delete the tracked VM before creating a new one') { replace = true }
           opts.on('--dry-run', 'Resolve config and print the create plan without creating a VM') { dry_run = true }
+          opts.on('--vllm', 'Enable vLLM+LiteLLM setup (overrides config)') { install_vllm = true }
+          opts.on('--no-vllm', 'Disable vLLM+LiteLLM setup (overrides config)') { install_vllm = false }
+          opts.on('--ollama', 'Enable Ollama setup (overrides config)') { install_ollama = true }
+          opts.on('--no-ollama', 'Disable Ollama setup (overrides config)') { install_ollama = false }
         end
         parser.parse!(@argv)
-        manager.create(replace: replace, dry_run: dry_run)
+        manager.create(replace: replace, dry_run: dry_run, install_vllm: install_vllm, install_ollama: install_ollama)
       when 'delete'
         vm_id = nil
         dry_run = false
@@ -1403,8 +1773,10 @@ module HyperstackVM
         manager.delete(vm_id: vm_id, dry_run: dry_run)
       when 'status'
         manager.status
+      when 'test'
+        manager.test
       else
-        raise Error, "Unknown command #{command.inspect}. Use create, delete, or status."
+        raise Error, "Unknown command #{command.inspect}. Use create, delete, status, or test."
       end
     end
   end
diff --git a/snippets/hyperstack/vllm-setup.txt b/snippets/hyperstack/vllm-setup.txt
new file mode 100644
index 0000000..9ea44a7
--- /dev/null
+++ b/snippets/hyperstack/vllm-setup.txt
@@ -0,0 +1,487 @@
+# vLLM + LiteLLM + Claude Code Setup for Hyperstack VM
+#
+# This document describes the full deployment of qwen3-coder-next (AWQ 4-bit)
+# via vLLM with a LiteLLM proxy for Claude Code compatibility.
+#
+# Architecture:
+#
+#   Claude Code (earth)                    Hyperstack VM (A100 80GB)
+#   ┌─────────────┐                       ┌──────────────────────────────┐
+#   │ claude CLI   │── Anthropic API ──>  │ LiteLLM proxy (:4000)       │
+#   │              │   /v1/messages        │   translates Anthropic →    │
+#   │              │   via WireGuard wg1   │   OpenAI chat completions   │
+#   └─────────────┘                       │         │                    │
+#                                         │         ▼                    │
+#   OpenCode (earth)                      │ vLLM engine (:11434)        │
+#   ┌─────────────┐                       │   /v1/chat/completions      │
+#   │ opencode     │── OpenAI API ──────> │   FlashAttention v2         │
+#   │              │   /v1/chat/completions│   prefix caching            │
+#   └─────────────┘                       │   bullpoint/Qwen3-Coder-    │
+#                                         │     Next-AWQ-4bit (45GB)    │
+#                                         └──────────────────────────────┘
+#
+# Why vLLM instead of Ollama:
+#   - FlashAttention v2: ~1.5-2x faster prefill for long prompts
+#   - Block-level prefix caching: partial KV cache reuse even when prompt
+#     changes mid-sequence (Ollama requires exact prefix match from token 0)
+#   - Chunked prefill: can interleave prefill and decode
+#   - Marlin kernels for AWQ MoE quantization
+#
+# Why LiteLLM:
+#   - Claude Code speaks Anthropic Messages API (/v1/messages) only
+#   - vLLM speaks OpenAI Chat Completions API (/v1/chat/completions) only
+#   - LiteLLM translates between them, mapping Claude model names to the
+#     actual vLLM model
+#
+# Model details:
+#   - Name: bullpoint/Qwen3-Coder-Next-AWQ-4bit (HuggingFace)
+#   - Architecture: MoE, 80B total params, 3B active per token
+#   - 512 experts, 10 activated + 1 shared per token
+#   - Hybrid attention: Gated DeltaNet + Gated Attention (48 layers)
+#   - Quantization: AWQ 4-bit, group size 32
+#   - Disk size: ~45GB (vs ~151GB at BF16)
+#   - VRAM usage: ~45GB weights + ~27GB KV cache at 92% utilization
+#   - Context: 262,144 tokens (256k native)
+#   - vLLM requirement: >= 0.15.0
+#
+# Hardware requirements:
+#   - Minimum: 1x A100 80GB (PCIe or SXM)
+#   - VRAM breakdown at gpu_memory_utilization=0.92:
+#       Model weights:  ~45 GiB
+#       KV cache:       ~27 GiB (298k tokens capacity, 4.49x concurrency at 262k)
+#       CUDA graphs:    ~3 GiB
+#       Total:          ~75 GiB / 80 GiB
+#
+# Ports:
+#   11434/tcp - vLLM OpenAI-compatible API (reuses Ollama port for firewall compat)
+#   4000/tcp  - LiteLLM Anthropic-compatible proxy
+#   Both restricted to 192.168.3.0/24 (WireGuard wg1 subnet)
+
+# ===========================================================================
+# STEP 1: Prerequisites
+# ===========================================================================
+# - VM with NVIDIA GPU, CUDA drivers, and Docker with nvidia-container-toolkit
+# - WireGuard wg1 tunnel already configured (see wg1-setup.sh)
+# - Ollama stopped and disabled if previously running:
+#
+#   sudo systemctl stop ollama
+#   sudo systemctl disable ollama
+
+# ===========================================================================
+# STEP 2: Storage setup
+# ===========================================================================
+# HuggingFace model cache on ephemeral storage (fast NVMe, survives reboots
+# on some providers but not guaranteed — model will re-download if lost).
+#
+#   sudo mkdir -p /ephemeral/hug
+#   sudo chmod -R 0777 /ephemeral/hug
+
+# ===========================================================================
+# STEP 3: vLLM Docker container
+# ===========================================================================
+# Pull and run vLLM. The model downloads on first start (~45GB, ~2.5 min).
+# After download, model loading takes ~65s and CUDA graph capture ~35s.
+# Total cold start: ~4-5 minutes.
+#
+#   docker pull vllm/vllm-openai:latest
+#
+#   docker run -d \
+#     --gpus all \
+#     --ipc=host \
+#     --network host \
+#     --name vllm_qwen3 \
+#     --restart always \
+#     -v /ephemeral/hug:/root/.cache/huggingface \
+#     vllm/vllm-openai:latest \
+#     --model bullpoint/Qwen3-Coder-Next-AWQ-4bit \
+#     --tensor-parallel-size 1 \
+#     --enable-auto-tool-choice \
+#     --tool-call-parser qwen3_coder \
+#     --enable-prefix-caching \
+#     --gpu-memory-utilization 0.92 \
+#     --max-model-len 262144 \
+#     --host 0.0.0.0 \
+#     --port 11434
+#
+# Flags explained:
+#   --tensor-parallel-size 1    Single GPU (use 2/4 for multi-GPU setups)
+#   --enable-auto-tool-choice   Enables function/tool calling
+#   --tool-call-parser qwen3_coder   Parser for qwen3-coder tool format
+#   --enable-prefix-caching     Block-level KV cache reuse across requests
+#   --gpu-memory-utilization 0.92   Use 92% of VRAM (rest for OS/overhead)
+#   --max-model-len 262144      Full 256k context window
+#   --port 11434                Reuse Ollama port for firewall compatibility
+#
+# Verify startup (wait for "Application startup complete"):
+#   docker logs -f vllm_qwen3 2>&1 | grep -E "startup complete|Error"
+#
+# Verify model loaded:
+#   curl -s http://localhost:11434/v1/models | python3 -m json.tool
+#
+# Quick inference test:
+#   curl -s http://localhost:11434/v1/chat/completions \
+#     -H "Content-Type: application/json" \
+#     -H "Authorization: Bearer EMPTY" \
+#     -d '{"model":"bullpoint/Qwen3-Coder-Next-AWQ-4bit",
+#          "messages":[{"role":"user","content":"Hello"}],
+#          "max_tokens":50}'
+#
+# Monitor performance (prefix cache hit rate, throughput):
+#   docker logs -f vllm_qwen3 2>&1 | grep "Engine 000"
+
+# ===========================================================================
+# STEP 4: LiteLLM proxy (Anthropic API translation for Claude Code)
+# ===========================================================================
+# Install in a Python venv (Ubuntu 24.04 requires this):
+#
+#   sudo apt-get install -y python3.12-venv
+#   sudo mkdir -p /ephemeral/litellm-env
+#   sudo chown ubuntu:ubuntu /ephemeral/litellm-env
+#   python3 -m venv /ephemeral/litellm-env
+#   /ephemeral/litellm-env/bin/pip install "litellm[proxy]"
+#
+# Write config file:
+#
+#   sudo tee /ephemeral/litellm-config.yaml > /dev/null << "YAML"
+#   model_list:
+#     - model_name: "claude-sonnet-4-20250514"
+#       litellm_params:
+#         model: "hosted_vllm/bullpoint/Qwen3-Coder-Next-AWQ-4bit"
+#         api_base: "http://localhost:11434/v1"
+#         api_key: "EMPTY"
+#     - model_name: "claude-opus-4-20250514"
+#       litellm_params:
+#         model: "hosted_vllm/bullpoint/Qwen3-Coder-Next-AWQ-4bit"
+#         api_base: "http://localhost:11434/v1"
+#         api_key: "EMPTY"
+#     - model_name: "claude-opus-4-6-20260604"
+#       litellm_params:
+#         model: "hosted_vllm/bullpoint/Qwen3-Coder-Next-AWQ-4bit"
+#         api_base: "http://localhost:11434/v1"
+#         api_key: "EMPTY"
+#     - model_name: "claude-haiku-3-5-20241022"
+#       litellm_params:
+#         model: "hosted_vllm/bullpoint/Qwen3-Coder-Next-AWQ-4bit"
+#         api_base: "http://localhost:11434/v1"
+#         api_key: "EMPTY"
+#
+#   litellm_settings:
+#     drop_params: true
+#
+#   general_settings:
+#     master_key: "sk-litellm-master"
+#   YAML
+#
+# Config notes:
+#   - model_name values must match what Claude Code sends (Claude model IDs)
+#   - "hosted_vllm/" prefix forces LiteLLM to use /v1/chat/completions
+#     (not /v1/responses which vLLM doesn't fully support for complex messages)
+#   - drop_params: true — silently drops Claude-specific parameters like
+#     context_management that vLLM doesn't understand
+#   - master_key is the API key clients must send
+#   - Add new model_name entries when Anthropic releases new model IDs
+#
+# Start LiteLLM:
+#
+#   nohup /ephemeral/litellm-env/bin/litellm \
+#     --config /ephemeral/litellm-config.yaml \
+#     --host 0.0.0.0 \
+#     --port 4000 \
+#     > /ephemeral/litellm.log 2>&1 &
+#
+# Verify:
+#   curl -s http://localhost:4000/v1/messages \
+#     -H "Content-Type: application/json" \
+#     -H "x-api-key: sk-litellm-master" \
+#     -H "anthropic-version: 2023-06-01" \
+#     -d '{"model":"claude-opus-4-6-20260604","max_tokens":50,
+#          "messages":[{"role":"user","content":"Hello"}]}'
+#
+# For production, create a systemd service instead of nohup:
+#
+#   sudo tee /etc/systemd/system/litellm.service > /dev/null << "UNIT"
+#   [Unit]
+#   Description=LiteLLM Proxy
+#   After=network.target docker.service
+#   Requires=docker.service
+#
+#   [Service]
+#   Type=simple
+#   User=ubuntu
+#   ExecStart=/ephemeral/litellm-env/bin/litellm \
+#     --config /ephemeral/litellm-config.yaml \
+#     --host 0.0.0.0 --port 4000
+#   Restart=always
+#   RestartSec=5
+#
+#   [Install]
+#   WantedBy=multi-user.target
+#   UNIT
+#
+#   sudo systemctl daemon-reload
+#   sudo systemctl enable --now litellm
+
+# ===========================================================================
+# STEP 5: Firewall rules
+# ===========================================================================
+# Allow access from WireGuard subnet only:
+#
+#   sudo ufw allow from 192.168.3.0/24 to any port 11434 proto tcp \
+#     comment 'vLLM via wg1'
+#   sudo ufw allow from 192.168.3.0/24 to any port 4000 proto tcp \
+#     comment 'LiteLLM proxy via wg1'
+
+# ===========================================================================
+# STEP 6: Client configuration (on earth / local machine)
+# ===========================================================================
+#
+# --- Claude Code ---
+# Launch with environment variables pointing at LiteLLM proxy:
+#
+#   ANTHROPIC_BASE_URL=http://192.168.3.1:4000 \
+#   ANTHROPIC_API_KEY=sk-litellm-master \
+#   claude --model claude-opus-4-6-20260604 --dangerously-skip-permissions
+#
+# Fish shell alias (add to ~/.config/fish/config.fish):
+#
+#   alias claude-local='ANTHROPIC_BASE_URL=http://192.168.3.1:4000 \
+#     ANTHROPIC_API_KEY=sk-litellm-master \
+#     claude --model claude-opus-4-6-20260604 --dangerously-skip-permissions'
+#
+# --- OpenCode ---
+# Connects directly to vLLM (no LiteLLM needed, speaks OpenAI natively):
+#
+#   OPENAI_BASE_URL=http://192.168.3.1:11434/v1 \
+#   OPENAI_API_KEY=EMPTY \
+#   opencode
+#
+# Model name in OpenCode config: bullpoint/Qwen3-Coder-Next-AWQ-4bit
+
+# ===========================================================================
+# STEP 7: Monitoring & troubleshooting
+# ===========================================================================
+#
+# --- Live engine stats ---
+# vLLM logs engine metrics every 10 seconds. Key fields:
+#   - Avg prompt throughput:     prefill speed (tokens/s), higher = faster
+#   - Avg generation throughput: decode speed (tokens/s), ~40-99 on A100 PCIe
+#   - GPU KV cache usage:        % of KV cache memory in use (proportional to
+#                                 active context length vs max capacity)
+#   - Prefix cache hit rate:     % of prompt tokens served from cache (0% for
+#                                 Claude Code, higher for OpenCode)
+#   - Running/Waiting:           active and queued request counts
+#
+# Follow live (all stats):
+#   docker logs -f vllm_qwen3 2>&1 | grep "Engine 000"
+#
+# Example output:
+#   Engine 000: Avg prompt throughput: 5555.2 tokens/s,
+#               Avg generation throughput: 49.4 tokens/s,
+#               Running: 1 reqs, Waiting: 0 reqs,
+#               GPU KV cache usage: 4.6%,
+#               Prefix cache hit rate: 0.0%
+#
+# --- Request-level monitoring ---
+# See individual HTTP requests (method, status, duration):
+#   docker logs -f vllm_qwen3 2>&1 | grep "POST"
+#
+# Example output:
+#   127.0.0.1:41864 - "POST /v1/chat/completions HTTP/1.1" 200 OK
+#
+# --- One-liner: last minute stats ---
+# Useful for periodic checks without following the log:
+#   docker logs --since 1m vllm_qwen3 2>&1 | grep "Engine 000"
+#
+# --- LiteLLM proxy log ---
+#   tail -f /ephemeral/litellm.log
+#
+# --- GPU hardware stats ---
+# Snapshot:
+#   nvidia-smi
+#
+# Continuous (every 5 seconds):
+#   nvidia-smi --query-gpu=temperature.gpu,utilization.gpu,power.draw,memory.used \
+#     --format=csv -l 5
+#
+# --- Interpreting the stats ---
+#
+# Healthy baseline (A100 80GB PCIe, qwen3-coder-next AWQ 4-bit):
+#   Prefill throughput:   5,000-11,000 tok/s (bursts higher during batch prefill)
+#   Decode throughput:    40-99 tok/s (varies with output length per sample)
+#   KV cache usage:       0-5% for short conversations, grows with context
+#                         (100% = 298k tokens, at which point requests queue)
+#   Prefix cache hit:     0% for Claude Code (expected, it mutates prompt prefix)
+#                         >50% for OpenCode after a few turns
+#   Temperature:          44-60C under load, <45C idle
+#   Power:                70W idle, 230-240W under load, 300W max
+#
+# Warning signs:
+#   - Waiting > 0 for extended periods → requests queuing, model overloaded
+#   - KV cache usage near 100% → context too long, reduce --max-model-len
+#   - Decode throughput < 20 tok/s sustained → possible thermal throttling
+#   - Prefill throughput < 2,000 tok/s → check for CPU offload or driver issues
+#
+# Common issues:
+#
+# 1. OOM on startup with --max-model-len 262144
+#    → Reduce to 131072 or 65536
+#
+# 2. "model does not exist" from vLLM
+#    → Model name in LiteLLM config must exactly match HuggingFace repo name
+#
+# 3. LiteLLM returns UnsupportedParamsError
+#    → Ensure drop_params: true is in litellm_settings
+#
+# 4. LiteLLM routes to /v1/responses instead of /v1/chat/completions
+#    → Use "hosted_vllm/" prefix in model field, not "openai/"
+#
+# 5. Claude Code "Auth conflict" warning
+#    → Run `claude /logout` first to clear the claude.ai session token,
+#      then re-launch with ANTHROPIC_API_KEY=sk-litellm-master
+#
+# 6. Prefix cache hit rate stays at 0%
+#    → Normal for Claude Code (it mutates the prompt prefix each turn)
+#    → OpenCode should show increasing cache hit rates after a few turns
+#
+# 7. vLLM container won't start (CUDA version mismatch)
+#    → Check driver version: nvidia-smi
+#    → vLLM requires CUDA >= 12.x and driver >= 535
+
+# ===========================================================================
+# STEP 8: Loading / switching models
+# ===========================================================================
+#
+# vLLM serves one model per container. To switch models, stop the current
+# container and start a new one with different --model.
+#
+# --- Stop current model ---
+#   docker stop vllm_qwen3
+#   docker rm vllm_qwen3
+#
+# --- Run a different model ---
+# Replace --model, --name, and adjust --max-model-len and --tool-call-parser
+# as needed. The HuggingFace model downloads automatically on first start.
+#
+# Example: qwen3-coder:30b (smaller, faster, fits easily on A100 80GB)
+#
+#   docker run -d \
+#     --gpus all \
+#     --ipc=host \
+#     --network host \
+#     --name vllm_qwen3_30b \
+#     --restart always \
+#     -v /ephemeral/hug:/root/.cache/huggingface \
+#     vllm/vllm-openai:latest \
+#     --model Qwen/Qwen3-Coder-30B-AWQ \
+#     --tensor-parallel-size 1 \
+#     --enable-auto-tool-choice \
+#     --tool-call-parser qwen3_coder \
+#     --enable-prefix-caching \
+#     --gpu-memory-utilization 0.92 \
+#     --max-model-len 131072 \
+#     --host 0.0.0.0 \
+#     --port 11434
+#
+# Example: full-precision model on multi-GPU (e.g. 4x H100)
+#
+#   docker run -d \
+#     --gpus all \
+#     --ipc=host \
+#     --network host \
+#     --name vllm_qwen3_fp16 \
+#     --restart always \
+#     -v /ephemeral/hug:/root/.cache/huggingface \
+#     vllm/vllm-openai:latest \
+#     --model Qwen/Qwen3-Coder-Next \
+#     --tensor-parallel-size 4 \
+#     --enable-auto-tool-choice \
+#     --tool-call-parser qwen3_coder \
+#     --enable-prefix-caching \
+#     --gpu-memory-utilization 0.90 \
+#     --max-model-len 262144 \
+#     --host 0.0.0.0 \
+#     --port 11434
+#
+# --- Update LiteLLM config to match ---
+# After switching models, update the model field in litellm-config.yaml
+# to match the new HuggingFace model name:
+#
+#   model: "hosted_vllm/<new-model-name>"
+#
+# Then restart LiteLLM:
+#   pkill -f litellm
+#   nohup /ephemeral/litellm-env/bin/litellm \
+#     --config /ephemeral/litellm-config.yaml \
+#     --host 0.0.0.0 --port 4000 \
+#     > /ephemeral/litellm.log 2>&1 &
+#
+# --- Finding models ---
+# Search HuggingFace for vLLM-compatible quantized models:
+#   https://huggingface.co/models?search=<model-name>+awq
+#   https://huggingface.co/models?search=<model-name>+gptq
+#
+# Supported quantization formats in vLLM:
+#   - AWQ (recommended): fast Marlin kernels, good quality
+#   - GPTQ: similar to AWQ, widely available
+#   - FP8: 8-bit, needs Hopper+ GPUs (H100/H200)
+#   - BF16/FP16: full precision, needs more VRAM
+#
+# --- VRAM sizing guide ---
+# Rule of thumb for single A100 80GB at 92% utilization (~75 GiB usable):
+#
+#   Model size (params)  | AWQ 4-bit VRAM | Max context (remaining for KV)
+#   ---------------------|----------------|-------------------------------
+#   7-8B                 | ~5 GiB         | 262k+ (plenty of KV headroom)
+#   14B                  | ~9 GiB         | 262k+ (plenty of KV headroom)
+#   30-32B               | ~18 GiB        | 262k  (~57 GiB for KV cache)
+#   70-80B (MoE, 3B act) | ~45 GiB        | 262k  (~27 GiB for KV cache)
+#   70B (dense)          | ~38 GiB        | 131k  (~37 GiB for KV cache)
+#   120B+                | won't fit      | use multi-GPU or smaller quant
+#
+# If vLLM OOMs on startup, reduce --max-model-len first (halving it roughly
+# halves KV cache memory). If still OOM, reduce --gpu-memory-utilization
+# to 0.85 or try a smaller model.
+#
+# --- Verifying the new model ---
+# Check loaded model:
+#   curl -s http://localhost:11434/v1/models | python3 -m json.tool
+#
+# Test inference:
+#   curl -s http://localhost:11434/v1/chat/completions \
+#     -H "Content-Type: application/json" \
+#     -H "Authorization: Bearer EMPTY" \
+#     -d '{"model":"<model-name>",
+#          "messages":[{"role":"user","content":"Hello"}],
+#          "max_tokens":50}'
+#
+# Test via LiteLLM (Anthropic API):
+#   curl -s http://localhost:4000/v1/messages \
+#     -H "Content-Type: application/json" \
+#     -H "x-api-key: sk-litellm-master" \
+#     -H "anthropic-version: 2023-06-01" \
+#     -d '{"model":"claude-opus-4-6-20260604","max_tokens":50,
+#          "messages":[{"role":"user","content":"Hello"}]}'
+
+# ===========================================================================
+# Performance characteristics (A100 80GB PCIe, single GPU)
+# ===========================================================================
+#
+# Measured on 2026-03-16 with bullpoint/Qwen3-Coder-Next-AWQ-4bit:
+#
+#   vLLM prefill throughput:    5,000-11,000 tok/s (FlashAttention v2)
+#   vLLM decode throughput:     40-99 tok/s (memory-bandwidth limited)
+#   Per-turn latency:           ~10-15s (small prompts, early conversation)
+#   KV cache usage:             2-5% for typical coding sessions
+#   Prefix cache hit rate:      0% (Claude Code), expected >50% (OpenCode)
+#
+# Comparison with Ollama on same hardware (A100 80GB PCIe):
+#
+#                          | Ollama (Q4_K_M)       | vLLM (AWQ 4-bit)
+#   -----------------------|-----------------------|----------------------
+#   Prefill throughput     | ~1,000 tok/s (est.)   | 5,000-11,000 tok/s
+#   Decode throughput      | ~40 tok/s             | 40-99 tok/s
+#   Per-turn latency       | ~28s (32k ctx)        | ~10-15s
+#   Context window         | 32k (was truncating)  | 262k (full, no truncation)
+#   Prefix cache (Claude)  | 0% always             | 0% always
+#   Prefix cache (OpenCode)| 85-95% when warm      | expected similar or better
+#   VRAM usage             | 52-61 GiB             | 75 GiB (more KV cache)
author	Paul Buetow <paul@buetow.org>	2026-03-18 09:10:14 +0200
committer	Paul Buetow <paul@buetow.org>	2026-03-18 09:10:14 +0200
commit	d8575832ae0022f94cd786b15f8b88de0bf18672 (patch)
tree	75872514846cfddb1434281a59b6673344023ff7 /snippets/hyperstack
parent	8dca92ea40b191b9de367197aac7e1f882ed3d43 (diff)