summaryrefslogtreecommitdiff
path: root/snippets/hyperstack
diff options
context:
space:
mode:
authorPaul Buetow <paul@buetow.org>2026-03-18 09:10:14 +0200
committerPaul Buetow <paul@buetow.org>2026-03-18 09:10:14 +0200
commitd8575832ae0022f94cd786b15f8b88de0bf18672 (patch)
tree75872514846cfddb1434281a59b6673344023ff7 /snippets/hyperstack
parent8dca92ea40b191b9de367197aac7e1f882ed3d43 (diff)
Add vLLM + LiteLLM support; rename script; add README
- Replace Ollama (disabled by default) with vLLM Docker container + LiteLLM Anthropic-API proxy as the default inference backend - vLLM setup: pulls vllm/vllm-openai, starts container on port 11434, polls until model is loaded (up to 10 min for first 45 GB download) - LiteLLM setup: installs in Python venv, writes config mapping Claude model aliases to the vLLM model, runs as a systemd service on port 4000 - New CLI flags on `create`: --vllm/--no-vllm, --ollama/--no-ollama to override config at runtime - New `test` command: end-to-end inference test over WireGuard against vLLM (/v1/models + /v1/chat/completions) and LiteLLM (/v1/messages) - UFW rules now open both port 11434 (inference) and 4000 (LiteLLM) from the WireGuard subnet - Rename hyperstack_vm.rb → hyperstack.rb - Add README.md with quickstart, Claude Code / OpenCode usage, CLI reference, monitoring commands, and VRAM sizing notes - Add vllm-setup.txt: detailed manual setup notes and architecture docs Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Diffstat (limited to 'snippets/hyperstack')
-rw-r--r--snippets/hyperstack/README.md157
-rw-r--r--snippets/hyperstack/hyperstack-vm.toml30
-rw-r--r--snippets/hyperstack/hyperstack.rb (renamed from snippets/hyperstack/hyperstack_vm.rb)402
-rw-r--r--snippets/hyperstack/vllm-setup.txt487
4 files changed, 1059 insertions, 17 deletions
diff --git a/snippets/hyperstack/README.md b/snippets/hyperstack/README.md
new file mode 100644
index 0000000..e5cc7ea
--- /dev/null
+++ b/snippets/hyperstack/README.md
@@ -0,0 +1,157 @@
+# hyperstack
+
+Automates Hyperstack GPU VM lifecycle: create, bootstrap, WireGuard tunnel, vLLM inference, LiteLLM proxy.
+
+## Architecture
+
+```
+Claude Code (local) Hyperstack VM (A100 80GB)
+┌─────────────────┐ ┌──────────────────────────────────┐
+│ claude CLI │── Anthropic API ─▶│ LiteLLM proxy (:4000) │
+│ │ /v1/messages │ Anthropic → OpenAI translation │
+│ │ via WireGuard │ │ │
+└─────────────────┘ │ ▼ │
+ │ vLLM engine (:11434) │
+OpenCode (local) │ bullpoint/Qwen3-Coder-Next- │
+┌─────────────────┐ │ AWQ-4bit (45 GB, MoE 80B) │
+│ opencode │── OpenAI API ────▶│ FlashAttention v2 │
+│ │ /v1/chat/... │ prefix caching │
+└─────────────────┘ └──────────────────────────────────┘
+```
+
+Both local clients connect over a WireGuard tunnel (`wg1`, subnet `192.168.3.0/24`).
+The VM gets `192.168.3.1`; your local machine gets `192.168.3.2`.
+
+## Prerequisites
+
+- Hyperstack account with API key in `~/.hyperstack`
+- SSH key registered in Hyperstack as `earth` (or change `ssh.hyperstack_key_name` in the TOML)
+- WireGuard setup script: `wg1-setup.sh` (present in this directory)
+- Ruby with `toml-rb` gem: `bundle install`
+
+## Quickstart
+
+```bash
+# Deploy VM, set up WireGuard + vLLM + LiteLLM (~10 min on first run)
+ruby hyperstack.rb create
+
+# Verify everything is working
+ruby hyperstack.rb test
+
+# Use Claude Code against the local vLLM
+ANTHROPIC_BASE_URL=http://192.168.3.1:4000 \
+ANTHROPIC_API_KEY=sk-litellm-master \
+claude --model claude-opus-4-6-20260604 --dangerously-skip-permissions
+
+# Tear down
+ruby hyperstack.rb delete
+```
+
+## Using Claude Code with vLLM
+
+WireGuard (`wg1`) must be active before connecting.
+
+```bash
+ANTHROPIC_BASE_URL=http://192.168.3.1:4000 \
+ANTHROPIC_API_KEY=sk-litellm-master \
+claude --model claude-opus-4-6-20260604 --dangerously-skip-permissions
+```
+
+If you see an **"Auth conflict"** warning, clear the saved claude.ai session first:
+
+```bash
+claude /logout
+```
+
+**Fish shell alias** (add to `~/.config/fish/config.fish`):
+
+```fish
+alias claude-local='ANTHROPIC_BASE_URL=http://192.168.3.1:4000 \
+ ANTHROPIC_API_KEY=sk-litellm-master \
+ claude --model claude-opus-4-6-20260604 --dangerously-skip-permissions'
+```
+
+**Available model aliases** — all map to the same vLLM model:
+
+| Alias | Use case |
+|-------|----------|
+| `claude-opus-4-6-20260604` | Recommended (most future-proof) |
+| `claude-opus-4-20250514` | |
+| `claude-sonnet-4-20250514` | |
+| `claude-haiku-3-5-20241022` | |
+
+Add new Anthropic model IDs to `vllm.litellm_claude_model_names` in `hyperstack-vm.toml` as they are released.
+
+## Using OpenCode with vLLM
+
+OpenCode speaks OpenAI natively — connect directly to vLLM, no LiteLLM needed:
+
+```bash
+OPENAI_BASE_URL=http://192.168.3.1:11434/v1 \
+OPENAI_API_KEY=EMPTY \
+opencode
+```
+
+Set the model name to `bullpoint/Qwen3-Coder-Next-AWQ-4bit` in your OpenCode config.
+
+## CLI reference
+
+```
+ruby hyperstack.rb [--config path] <command> [options]
+
+Commands:
+ create Deploy a new VM and run full provisioning
+ delete Destroy the tracked VM
+ status Show VM and WireGuard status
+ test Run end-to-end inference tests (vLLM + LiteLLM)
+
+create options:
+ --replace Delete existing tracked VM before creating
+ --dry-run Print the plan without making changes
+ --vllm / --no-vllm Override config: enable/disable vLLM+LiteLLM setup
+ --ollama / --no-ollama Override config: enable/disable Ollama setup
+```
+
+## Configuration
+
+Edit `hyperstack-vm.toml` to change defaults. Key sections:
+
+| Section | Purpose |
+|---------|---------|
+| `[vm]` | Flavor, image, environment name |
+| `[vllm]` | Model, container settings, LiteLLM key and Claude aliases |
+| `[ollama]` | Ollama settings (disabled by default; set `install = true` to use instead) |
+| `[network]` | Ports, WireGuard subnet, allowed CIDRs |
+| `[wireguard]` | Auto-setup script path |
+
+## Monitoring vLLM
+
+```bash
+# Live engine stats (throughput, KV cache, prefix cache hit rate)
+ssh ubuntu@<vm-ip> 'docker logs -f vllm_qwen3 2>&1 | grep "Engine 000"'
+
+# Last 1 minute of stats
+ssh ubuntu@<vm-ip> 'docker logs --since 1m vllm_qwen3 2>&1 | grep "Engine 000"'
+
+# GPU stats (every 5 s)
+ssh ubuntu@<vm-ip> 'nvidia-smi --query-gpu=temperature.gpu,utilization.gpu,power.draw,memory.used --format=csv -l 5'
+
+# LiteLLM proxy log
+ssh ubuntu@<vm-ip> 'sudo journalctl -fu litellm'
+```
+
+Healthy baseline (A100 80GB PCIe, qwen3-coder-next AWQ 4-bit):
+
+| Metric | Expected |
+|--------|----------|
+| Prefill throughput | 5,000–11,000 tok/s |
+| Decode throughput | 40–99 tok/s |
+| KV cache usage | 2–5% for typical sessions |
+| Prefix cache hit (Claude Code) | 0% (expected — prompt prefix mutates each turn) |
+| Prefix cache hit (OpenCode) | >50% after warm-up |
+
+## Switching models
+
+Stop the current container, start a new one with a different `--model`, then update `vllm.model` in `hyperstack-vm.toml` and re-run `ruby hyperstack.rb create` to reinstall LiteLLM with the updated config.
+
+See `vllm-setup.txt` for detailed vLLM and LiteLLM setup notes, VRAM sizing guide, and troubleshooting.
diff --git a/snippets/hyperstack/hyperstack-vm.toml b/snippets/hyperstack/hyperstack-vm.toml
index 2d83b0f..0ea3cfc 100644
--- a/snippets/hyperstack/hyperstack-vm.toml
+++ b/snippets/hyperstack/hyperstack-vm.toml
@@ -31,7 +31,10 @@ connect_timeout_sec = 10
[network]
wireguard_udp_port = 56710
wireguard_subnet = "192.168.3.0/24"
+# Port 11434 is shared by both Ollama and vLLM for firewall compatibility.
ollama_port = 11434
+# Port 4000: LiteLLM Anthropic-API proxy (used with vLLM).
+litellm_port = 4000
allowed_ssh_cidrs = ["0.0.0.0/0"]
allowed_wireguard_cidrs = ["0.0.0.0/0"]
@@ -42,13 +45,36 @@ configure_ufw = true
configure_ollama_host = false
[ollama]
-install = true
+# Disabled in favour of vLLM; set install = true to switch back to Ollama.
+install = false
models_dir = "/ephemeral/ollama/models"
listen_host = "0.0.0.0:11434"
gpu_overhead_mb = 2000
-num_parallel = 4
+num_parallel = 1
+context_length = 32768
pull_models = ["qwen3-coder-next", "qwen3-coder:30b", "gpt-oss:20b", "gpt-oss:120b", "nemotron-3-super"]
+# vLLM serves one model via Docker; LiteLLM translates Anthropic API → OpenAI.
+# Use --vllm / --no-vllm CLI flags to override install at runtime.
+[vllm]
+install = true
+model = "bullpoint/Qwen3-Coder-Next-AWQ-4bit"
+# HuggingFace model cache on ephemeral NVMe (fast; survives reboots on most providers).
+hug_cache_dir = "/ephemeral/hug"
+container_name = "vllm_qwen3"
+max_model_len = 262144
+gpu_memory_utilization = 0.92
+tensor_parallel_size = 1
+tool_call_parser = "qwen3_coder"
+# LiteLLM maps each entry to the vLLM model; add new Anthropic model IDs here.
+litellm_master_key = "sk-litellm-master"
+litellm_claude_model_names = [
+ "claude-sonnet-4-20250514",
+ "claude-opus-4-20250514",
+ "claude-opus-4-6-20260604",
+ "claude-haiku-3-5-20241022"
+]
+
[wireguard]
auto_setup = true
setup_script = "./wg1-setup.sh"
diff --git a/snippets/hyperstack/hyperstack_vm.rb b/snippets/hyperstack/hyperstack.rb
index ac60da9..c84d013 100644
--- a/snippets/hyperstack/hyperstack_vm.rb
+++ b/snippets/hyperstack/hyperstack.rb
@@ -62,7 +62,8 @@ module HyperstackVM
'network' => {
'wireguard_udp_port' => 56_710,
'wireguard_subnet' => '192.168.3.0/24',
- 'ollama_port' => 11_434,
+ 'ollama_port' => 11_434, # reused by vLLM for firewall compatibility
+ 'litellm_port' => 4_000,
'allowed_ssh_cidrs' => ['0.0.0.0/0'],
'allowed_wireguard_cidrs' => ['0.0.0.0/0']
},
@@ -73,13 +74,34 @@ module HyperstackVM
'configure_ollama_host' => false
},
'ollama' => {
- 'install' => true,
+ # Disabled in favour of vLLM; set install: true to use Ollama instead.
+ 'install' => false,
'models_dir' => '/ephemeral/ollama/models',
'listen_host' => '0.0.0.0:11434',
'gpu_overhead_mb' => 2000,
- 'num_parallel' => 4,
+ 'num_parallel' => 1,
+ 'context_length' => 32_768,
'pull_models' => ['qwen3-coder:30b', 'gpt-oss:20b', 'gpt-oss:120b', 'nemotron-3-super']
},
+ 'vllm' => {
+ # vLLM serves one model via Docker; LiteLLM translates Anthropic API → OpenAI chat completions.
+ 'install' => true,
+ 'model' => 'bullpoint/Qwen3-Coder-Next-AWQ-4bit',
+ 'hug_cache_dir' => '/ephemeral/hug',
+ 'container_name' => 'vllm_qwen3',
+ 'max_model_len' => 262_144,
+ 'gpu_memory_utilization' => 0.92,
+ 'tensor_parallel_size' => 1,
+ 'tool_call_parser' => 'qwen3_coder',
+ # LiteLLM maps each Claude model alias to the vLLM model; add new Anthropic IDs here.
+ 'litellm_claude_model_names' => %w[
+ claude-sonnet-4-20250514
+ claude-opus-4-20250514
+ claude-opus-4-6-20260604
+ claude-haiku-3-5-20241022
+ ],
+ 'litellm_master_key' => 'sk-litellm-master'
+ },
'wireguard' => {
'auto_setup' => true,
'setup_script' => './wg1-setup.sh'
@@ -216,6 +238,19 @@ module HyperstackVM
Integer(fetch('network', 'ollama_port'))
end
+ def litellm_port
+ Integer(fetch('network', 'litellm_port'))
+ end
+
+ # Derives the VM's WireGuard IP as the first host in the subnet (network + 1).
+ # E.g. 192.168.3.0/24 → 192.168.3.1
+ def wireguard_gateway_ip
+ base = IPAddr.new(wireguard_subnet).to_s
+ parts = base.split('.').map(&:to_i)
+ parts[-1] += 1
+ parts.join('.')
+ end
+
def allowed_ssh_cidrs
Array(fetch('network', 'allowed_ssh_cidrs')).map(&:to_s)
end
@@ -260,10 +295,58 @@ module HyperstackVM
Integer(fetch('ollama', 'num_parallel'))
end
+ # Maximum context length for Ollama inference; keeps KV cache bounded
+ # on single-GPU setups to avoid slow prefill at large context sizes.
+ def ollama_context_length
+ Integer(fetch('ollama', 'context_length'))
+ end
+
def ollama_pull_models
Array(fetch('ollama', 'pull_models')).map(&:to_s)
end
+ def vllm_install_enabled?
+ truthy?(fetch('vllm', 'install'))
+ end
+
+ def vllm_model
+ fetch('vllm', 'model')
+ end
+
+ def vllm_hug_cache_dir
+ fetch('vllm', 'hug_cache_dir')
+ end
+
+ def vllm_container_name
+ fetch('vllm', 'container_name')
+ end
+
+ def vllm_max_model_len
+ Integer(fetch('vllm', 'max_model_len'))
+ end
+
+ def vllm_gpu_memory_utilization
+ Float(fetch('vllm', 'gpu_memory_utilization'))
+ end
+
+ def vllm_tensor_parallel_size
+ Integer(fetch('vllm', 'tensor_parallel_size'))
+ end
+
+ def vllm_tool_call_parser
+ fetch('vllm', 'tool_call_parser')
+ end
+
+ # Claude model aliases that LiteLLM maps to the vLLM model.
+ # Must match what Claude Code sends in the model field.
+ def litellm_claude_model_names
+ Array(fetch('vllm', 'litellm_claude_model_names')).map(&:to_s)
+ end
+
+ def litellm_master_key
+ fetch('vllm', 'litellm_master_key')
+ end
+
def local_client_checks_enabled?
truthy?(fetch('local_client', 'check_wg1_service'))
end
@@ -295,14 +378,17 @@ module HyperstackVM
rules << firewall_rule('udp', wireguard_udp_port, cidr)
end
+ # Port 11434: shared by Ollama and vLLM (WireGuard-subnet-restricted).
rules << firewall_rule('tcp', ollama_port, wireguard_subnet)
+ # Port 4000: LiteLLM Anthropic-API proxy (WireGuard-subnet-restricted).
+ rules << firewall_rule('tcp', litellm_port, wireguard_subnet)
rules.uniq
end
private
def validate!
- %w[auth hyperstack state vm ssh network bootstrap ollama wireguard local_client].each do |section|
+ %w[auth hyperstack state vm ssh network bootstrap ollama vllm wireguard local_client].each do |section|
raise Error, "Missing config section [#{section}]" unless @data.key?(section)
end
@@ -619,7 +705,10 @@ module HyperstackVM
@out = out
end
- def create(replace: false, dry_run: false)
+ def create(replace: false, dry_run: false, install_vllm: nil, install_ollama: nil)
+ # CLI flags override config; nil means "use config default".
+ @effective_vllm = install_vllm.nil? ? @config.vllm_install_enabled? : install_vllm
+ @effective_ollama = install_ollama.nil? ? @config.ollama_install_enabled? : install_ollama
existing_state = @state_store.load
if existing_state && existing_state['vm_id']
if replace
@@ -721,10 +810,36 @@ module HyperstackVM
print_local_wireguard_summary(state&.dig('public_ip'))
end
+ # Runs end-to-end inference tests against vLLM and LiteLLM over WireGuard.
+ # Requires wg1 to be active and the VM to be fully provisioned.
+ def test
+ state = @state_store.load
+ raise Error, "No tracked VM state file found at #{@state_store.path}." if state.nil?
+
+ wg_ip = @config.wireguard_gateway_ip
+ info "Running end-to-end inference tests via WireGuard (#{wg_ip})..."
+
+ if @config.vllm_install_enabled?
+ test_vllm(wg_ip)
+ test_litellm(wg_ip)
+ end
+
+ if @config.ollama_install_enabled?
+ info " Ollama test: connect via SSH and run 'ollama list' to verify models."
+ end
+
+ info 'All inference tests passed.'
+ end
+
private
def resumable_state?(state)
- state['vm_id'] && (state['bootstrapped_at'].nil? || ollama_setup_needed?(state) || wireguard_setup_needed?(state))
+ state['vm_id'] && (
+ state['bootstrapped_at'].nil? ||
+ ollama_setup_needed?(state) ||
+ vllm_setup_needed?(state) ||
+ wireguard_setup_needed?(state)
+ )
end
def continue_create(state)
@@ -747,7 +862,7 @@ module HyperstackVM
# Install Ollama binary and configure the service (fast), but defer
# model pulls until after the WireGuard tunnel is up so that the user
# can monitor progress over the tunnel.
- if @config.ollama_install_enabled? && state['ollama_installed_at'].nil?
+ if effective_ollama? && state['ollama_installed_at'].nil?
install_ollama_service(state['public_ip'])
state['ollama_installed_at'] = Time.now.utc.iso8601
@state_store.save(state)
@@ -759,7 +874,7 @@ module HyperstackVM
@state_store.save(state)
end
- # Pull and verify models after the tunnel is established
+ # Pull and verify Ollama models after the tunnel is established.
if ollama_setup_needed?(state)
pull_ollama_models(state['public_ip'])
state['ollama_setup_at'] = Time.now.utc.iso8601
@@ -768,6 +883,15 @@ module HyperstackVM
@state_store.save(state)
end
+ # Set up vLLM (Docker container) + LiteLLM (Anthropic-API proxy) after
+ # the tunnel is up so that model-download progress is visible locally.
+ if vllm_setup_needed?(state)
+ setup_vllm_stack(state['public_ip'])
+ state['vllm_setup_at'] = Time.now.utc.iso8601
+ state['vllm_model'] = @config.vllm_model
+ @state_store.save(state)
+ end
+
vm = @client.get_vm(vm_id)
state['security_rules'] = Array(vm['security_rules']).map { |rule| normalize_rule(rule) }
state['status'] = vm['status']
@@ -777,6 +901,12 @@ module HyperstackVM
info "VM ready: #{state['public_ip']} (id=#{state['vm_id']})"
print_local_wireguard_summary(state['public_ip'])
+ if effective_vllm?
+ wg_ip = @config.wireguard_gateway_ip
+ info "Run 'ruby hyperstack.rb test' to verify vLLM and LiteLLM."
+ info " vLLM: http://#{wg_ip}:#{@config.ollama_port}/v1/models"
+ info " LiteLLM: http://#{wg_ip}:#{@config.litellm_port}/v1/messages"
+ end
end
def build_create_payload(vm_name, resolved)
@@ -897,7 +1027,7 @@ module HyperstackVM
end
def ollama_setup_needed?(state)
- return false unless @config.ollama_install_enabled?
+ return false unless effective_ollama?
# Re-run setup if state has no record, or if desired models changed
return true if state['ollama_setup_at'].nil?
@@ -1108,12 +1238,18 @@ module HyperstackVM
else
info 'Guest bootstrap is disabled in config.'
end
- if @config.ollama_install_enabled?
+ if effective_ollama?
info "Ollama will be installed with models stored under #{@config.ollama_models_dir}"
unless desired_ollama_models.empty?
info "Ollama models to pre-pull: #{desired_ollama_models.join(', ')}"
end
end
+ if effective_vllm?
+ info "vLLM will be installed: #{@config.vllm_model}"
+ info " Container: #{@config.vllm_container_name}, port #{@config.ollama_port}, max_model_len #{@config.vllm_max_model_len}"
+ info "LiteLLM proxy will be installed on port #{@config.litellm_port}"
+ info " Claude model aliases: #{@config.litellm_claude_model_names.join(', ')}"
+ end
if @config.wireguard_auto_setup?
info "WireGuard auto-setup script: #{@config.wireguard_setup_script} <vm_public_ip>"
end
@@ -1139,6 +1275,10 @@ module HyperstackVM
info "Ollama models to pre-pull: #{desired_ollama_models.join(', ')}"
end
end
+ if vllm_setup_needed?(state)
+ info "vLLM would be installed: #{@config.vllm_model}"
+ info "LiteLLM proxy would be installed on port #{@config.litellm_port}"
+ end
if wireguard_setup_needed?(state)
info "WireGuard auto-setup script would run: #{@config.wireguard_setup_script} #{state['public_ip'] || '<pending-public-ip>'}"
end
@@ -1197,7 +1337,10 @@ module HyperstackVM
script << "sudo ufw allow #{@config.ssh_port}/tcp comment 'Allow SSH' >/dev/null 2>&1 || true"
script << 'sudo ufw --force enable >/dev/null 2>&1 || true'
script << "sudo ufw allow #{@config.wireguard_udp_port}/udp comment 'WireGuard #{@config.local_interface_name}' >/dev/null 2>&1 || true"
- script << "sudo ufw allow from #{Shellwords.escape(@config.wireguard_subnet)} to any port #{@config.ollama_port} proto tcp comment 'Ollama via #{@config.local_interface_name}' >/dev/null 2>&1 || true"
+ # Port 11434 is shared by Ollama and vLLM; open for both regardless of which is installed.
+ script << "sudo ufw allow from #{Shellwords.escape(@config.wireguard_subnet)} to any port #{@config.ollama_port} proto tcp comment 'Inference API (Ollama/vLLM) via #{@config.local_interface_name}' >/dev/null 2>&1 || true"
+ # Port 4000: LiteLLM proxy (Anthropic API → vLLM); open alongside the inference port.
+ script << "sudo ufw allow from #{Shellwords.escape(@config.wireguard_subnet)} to any port #{@config.litellm_port} proto tcp comment 'LiteLLM proxy via #{@config.local_interface_name}' >/dev/null 2>&1 || true"
end
if @config.configure_ollama_host?
@@ -1260,6 +1403,7 @@ module HyperstackVM
script << "Environment=\"OLLAMA_MODELS=#{models_dir}\""
script << "Environment=\"OLLAMA_GPU_OVERHEAD=#{@config.ollama_gpu_overhead_mb}\""
script << "Environment=\"OLLAMA_NUM_PARALLEL=#{@config.ollama_num_parallel}\""
+ script << "Environment=\"OLLAMA_CONTEXT_LENGTH=#{@config.ollama_context_length}\""
script << "Environment=\"OLLAMA_HOST=#{listen_host}\""
script << 'OVERRIDE'
script << 'sudo systemctl daemon-reload'
@@ -1302,6 +1446,225 @@ module HyperstackVM
script.join("\n")
end
+ # Returns the effective Ollama flag: CLI override if set, else config default.
+ def effective_ollama?
+ defined?(@effective_ollama) ? @effective_ollama : @config.ollama_install_enabled?
+ end
+
+ # Returns the effective vLLM flag: CLI override if set, else config default.
+ def effective_vllm?
+ defined?(@effective_vllm) ? @effective_vllm : @config.vllm_install_enabled?
+ end
+
+ def vllm_setup_needed?(state)
+ return false unless effective_vllm?
+ # Re-run if never set up, or if the configured model changed since last setup.
+ return true if state['vllm_setup_at'].nil?
+
+ state['vllm_model'] != @config.vllm_model
+ end
+
+ def setup_vllm_stack(host)
+ info "Setting up vLLM Docker container on #{host}..."
+ output, status = run_ssh_command_streaming(host, vllm_install_script)
+ raise Error, "vLLM install failed: #{output.strip}" unless status.success?
+
+ info "Setting up LiteLLM Anthropic-API proxy on #{host}..."
+ output, status = run_ssh_command_streaming(host, litellm_install_script)
+ raise Error, "LiteLLM install failed: #{output.strip}" unless status.success?
+ end
+
+ # Generates the remote shell script that pulls the vLLM Docker image, starts
+ # the container, and polls until the model is fully loaded (up to 10 minutes
+ # to cover the first-run ~45 GB model download).
+ def vllm_install_script
+ model = @config.vllm_model
+ cache_dir = @config.vllm_hug_cache_dir
+ container = @config.vllm_container_name
+ max_len = @config.vllm_max_model_len
+ gpu_util = @config.vllm_gpu_memory_utilization
+ tp_size = @config.vllm_tensor_parallel_size
+ parser = @config.vllm_tool_call_parser
+ port = @config.ollama_port # vLLM reuses the Ollama port for firewall compat
+
+ docker_run = [
+ 'docker run -d',
+ '--gpus all', '--ipc=host', '--network host',
+ "--name #{Shellwords.escape(container)}",
+ '--restart always',
+ "-v #{Shellwords.escape(cache_dir)}:/root/.cache/huggingface",
+ 'vllm/vllm-openai:latest',
+ "--model #{Shellwords.escape(model)}",
+ "--tensor-parallel-size #{tp_size}",
+ '--enable-auto-tool-choice',
+ "--tool-call-parser #{Shellwords.escape(parser)}",
+ '--enable-prefix-caching',
+ "--gpu-memory-utilization #{gpu_util}",
+ "--max-model-len #{max_len}",
+ '--host 0.0.0.0',
+ "--port #{port}"
+ ].join(' ')
+
+ script = []
+ script << 'set -euo pipefail'
+ script << "sudo mkdir -p #{Shellwords.escape(cache_dir)}"
+ script << "sudo chmod -R 0777 #{Shellwords.escape(cache_dir)}"
+ # Stop and remove any existing container so re-runs are idempotent.
+ script << "docker stop #{Shellwords.escape(container)} 2>/dev/null || true"
+ script << "docker rm #{Shellwords.escape(container)} 2>/dev/null || true"
+ script << 'docker pull vllm/vllm-openai:latest'
+ script << docker_run
+ # Poll until the model is loaded:
+ # first run: ~45 GB download (~2.5 min) + model load (~65 s) + CUDA graphs (~35 s) ≈ 4-5 min
+ # warm restart: model load + CUDA graphs ≈ 100 s
+ # Timeout: 120 × 5 s = 10 minutes
+ script << 'echo "Waiting for vLLM to become ready (up to 10 min for first model download)..."'
+ script << "for i in $(seq 1 120); do"
+ script << " if curl -sf http://localhost:#{port}/v1/models >/dev/null 2>&1; then echo vllm-ready; break; fi"
+ script << " state=$(docker inspect --format='{{.State.Status}}' #{Shellwords.escape(container)} 2>/dev/null || echo unknown)"
+ script << ' echo " vLLM not ready yet ($i/120, container=$state)..."'
+ script << ' sleep 5'
+ script << 'done'
+ script << "curl -sf http://localhost:#{port}/v1/models >/dev/null || { echo 'FATAL: vLLM did not become ready within 10 minutes'; exit 1; }"
+ script << 'echo vllm-install-ok'
+ script.join("\n")
+ end
+
+ # Generates the remote shell script that installs LiteLLM in a Python venv,
+ # writes a config mapping Claude model aliases to the vLLM endpoint, and
+ # starts the proxy as a systemd service on litellm_port.
+ def litellm_install_script
+ port = @config.litellm_port
+ vllm_port = @config.ollama_port
+ model = @config.vllm_model
+ claude_names = @config.litellm_claude_model_names
+ master_key = @config.litellm_master_key
+
+ # Build model_list YAML entries; each Claude alias maps to the vLLM model.
+ # "hosted_vllm/" prefix forces LiteLLM to use /v1/chat/completions (not /v1/responses).
+ model_entries = claude_names.flat_map do |name|
+ [
+ " - model_name: \"#{name}\"",
+ ' litellm_params:',
+ " model: \"hosted_vllm/#{model}\"",
+ " api_base: \"http://localhost:#{vllm_port}/v1\"",
+ ' api_key: "EMPTY"'
+ ]
+ end
+
+ script = []
+ script << 'set -euo pipefail'
+ script << 'sudo apt-get install -y python3.12-venv'
+ script << 'sudo mkdir -p /ephemeral/litellm-env'
+ script << 'sudo chown ubuntu:ubuntu /ephemeral/litellm-env'
+ script << 'python3 -m venv /ephemeral/litellm-env'
+ script << '/ephemeral/litellm-env/bin/pip install --quiet "litellm[proxy]"'
+
+ # Write litellm-config.yaml via heredoc; drop_params silently discards
+ # Claude-specific params (e.g. context_management) that vLLM ignores.
+ script << "sudo tee /ephemeral/litellm-config.yaml > /dev/null << 'LITELLM_YAML'"
+ script << 'model_list:'
+ script.concat(model_entries)
+ script << ''
+ script << 'litellm_settings:'
+ script << ' drop_params: true'
+ script << ''
+ script << 'general_settings:'
+ script << " master_key: \"#{master_key}\""
+ script << 'LITELLM_YAML'
+
+ # Write systemd unit via heredoc; restart on failure so transient crashes self-heal.
+ script << "sudo tee /etc/systemd/system/litellm.service > /dev/null << 'LITELLM_UNIT'"
+ script << '[Unit]'
+ script << 'Description=LiteLLM Proxy'
+ script << 'After=network.target docker.service'
+ script << 'Requires=docker.service'
+ script << ''
+ script << '[Service]'
+ script << 'Type=simple'
+ script << 'User=ubuntu'
+ script << "ExecStart=/ephemeral/litellm-env/bin/litellm --config /ephemeral/litellm-config.yaml --host 0.0.0.0 --port #{port}"
+ script << 'Restart=always'
+ script << 'RestartSec=5'
+ script << ''
+ script << '[Install]'
+ script << 'WantedBy=multi-user.target'
+ script << 'LITELLM_UNIT'
+
+ script << 'sudo systemctl daemon-reload'
+ script << 'sudo systemctl enable --now litellm'
+ script << 'sleep 5'
+ script << 'systemctl is-active --quiet litellm'
+ script << 'echo litellm-install-ok'
+ script.join("\n")
+ end
+
+ # Tests the vLLM OpenAI-compatible API: lists loaded models and runs a
+ # short inference request to confirm the model accepts requests.
+ def test_vllm(wg_ip)
+ port = @config.ollama_port
+ model = @config.vllm_model
+
+ info " Testing vLLM models list at http://#{wg_ip}:#{port}/v1/models..."
+ uri = URI("http://#{wg_ip}:#{port}/v1/models")
+ resp = Net::HTTP.get_response(uri)
+ raise Error, "vLLM /v1/models returned HTTP #{resp.code}" unless resp.code == '200'
+
+ models = JSON.parse(resp.body).fetch('data', []).map { |m| m['id'] }
+ raise Error, "vLLM returned an empty model list (expected #{model})" if models.empty?
+
+ info " Models loaded: #{models.join(', ')}"
+ info " Testing vLLM inference..."
+ reply = vllm_chat(wg_ip, port, model, 'Say hello in five words.')
+ info " vLLM response: #{reply}"
+ rescue Errno::ECONNREFUSED, Errno::EHOSTUNREACH, SocketError => e
+ raise Error, "Cannot reach vLLM at #{wg_ip}:#{port} — is WireGuard (wg1) active? (#{e.message})"
+ end
+
+ # Tests the LiteLLM proxy using the Anthropic Messages API format,
+ # which is what Claude Code sends when pointed at a custom base URL.
+ def test_litellm(wg_ip)
+ port = @config.litellm_port
+ model = @config.litellm_claude_model_names.first
+ key = @config.litellm_master_key
+
+ info " Testing LiteLLM proxy at http://#{wg_ip}:#{port}/v1/messages..."
+ uri = URI("http://#{wg_ip}:#{port}/v1/messages")
+ req = Net::HTTP::Post.new(uri)
+ req['Content-Type'] = 'application/json'
+ req['x-api-key'] = key
+ req['anthropic-version'] = '2023-06-01'
+ req.body = JSON.generate(
+ 'model' => model,
+ 'max_tokens' => 50,
+ 'messages' => [{ 'role' => 'user', 'content' => 'Say hello in five words.' }]
+ )
+ resp = Net::HTTP.start(uri.host, uri.port, open_timeout: 10, read_timeout: 120) { |h| h.request(req) }
+ raise Error, "LiteLLM returned HTTP #{resp.code}: #{resp.body}" unless resp.code == '200'
+
+ text = JSON.parse(resp.body).fetch('content', []).find { |b| b['type'] == 'text' }&.dig('text').to_s.strip
+ info " LiteLLM response: #{text}"
+ rescue Errno::ECONNREFUSED, Errno::EHOSTUNREACH, SocketError => e
+ raise Error, "Cannot reach LiteLLM at #{wg_ip}:#{port} — is WireGuard (wg1) active? (#{e.message})"
+ end
+
+ # Sends a single OpenAI chat completion request and returns the reply text.
+ def vllm_chat(host, port, model, prompt)
+ uri = URI("http://#{host}:#{port}/v1/chat/completions")
+ req = Net::HTTP::Post.new(uri)
+ req['Content-Type'] = 'application/json'
+ req['Authorization'] = 'Bearer EMPTY'
+ req.body = JSON.generate(
+ 'model' => model,
+ 'messages' => [{ 'role' => 'user', 'content' => prompt }],
+ 'max_tokens' => 50
+ )
+ resp = Net::HTTP.start(uri.host, uri.port, open_timeout: 10, read_timeout: 120) { |h| h.request(req) }
+ raise Error, "vLLM inference returned HTTP #{resp.code}" unless resp.code == '200'
+
+ JSON.parse(resp.body).dig('choices', 0, 'message', 'content').to_s.strip
+ end
+
def integer_or_nil(value)
value.nil? ? nil : Integer(value)
end
@@ -1347,7 +1710,7 @@ module HyperstackVM
}
global_parser = OptionParser.new do |opts|
- opts.banner = 'Usage: ruby hyperstack_vm.rb [--config path] <create|delete|status> [options]'
+ opts.banner = 'Usage: ruby hyperstack.rb [--config path] <create|delete|status> [options]'
opts.on('--config PATH', "Path to TOML config (default: #{global[:config_path]})") do |value|
global[:config_path] = value
end
@@ -1355,9 +1718,10 @@ module HyperstackVM
puts opts
puts
puts 'Commands:'
- puts ' create [--replace] [--dry-run]'
+ puts ' create [--replace] [--dry-run] [--vllm|--no-vllm] [--ollama|--no-ollama]'
puts ' delete [--vm-id ID] [--dry-run]'
puts ' status'
+ puts ' test'
exit 0
end
end
@@ -1384,12 +1748,18 @@ module HyperstackVM
when 'create'
replace = false
dry_run = false
+ install_vllm = nil
+ install_ollama = nil
parser = OptionParser.new do |opts|
opts.on('--replace', 'Delete the tracked VM before creating a new one') { replace = true }
opts.on('--dry-run', 'Resolve config and print the create plan without creating a VM') { dry_run = true }
+ opts.on('--vllm', 'Enable vLLM+LiteLLM setup (overrides config)') { install_vllm = true }
+ opts.on('--no-vllm', 'Disable vLLM+LiteLLM setup (overrides config)') { install_vllm = false }
+ opts.on('--ollama', 'Enable Ollama setup (overrides config)') { install_ollama = true }
+ opts.on('--no-ollama', 'Disable Ollama setup (overrides config)') { install_ollama = false }
end
parser.parse!(@argv)
- manager.create(replace: replace, dry_run: dry_run)
+ manager.create(replace: replace, dry_run: dry_run, install_vllm: install_vllm, install_ollama: install_ollama)
when 'delete'
vm_id = nil
dry_run = false
@@ -1403,8 +1773,10 @@ module HyperstackVM
manager.delete(vm_id: vm_id, dry_run: dry_run)
when 'status'
manager.status
+ when 'test'
+ manager.test
else
- raise Error, "Unknown command #{command.inspect}. Use create, delete, or status."
+ raise Error, "Unknown command #{command.inspect}. Use create, delete, status, or test."
end
end
end
diff --git a/snippets/hyperstack/vllm-setup.txt b/snippets/hyperstack/vllm-setup.txt
new file mode 100644
index 0000000..9ea44a7
--- /dev/null
+++ b/snippets/hyperstack/vllm-setup.txt
@@ -0,0 +1,487 @@
+# vLLM + LiteLLM + Claude Code Setup for Hyperstack VM
+#
+# This document describes the full deployment of qwen3-coder-next (AWQ 4-bit)
+# via vLLM with a LiteLLM proxy for Claude Code compatibility.
+#
+# Architecture:
+#
+# Claude Code (earth) Hyperstack VM (A100 80GB)
+# ┌─────────────┐ ┌──────────────────────────────┐
+# │ claude CLI │── Anthropic API ──> │ LiteLLM proxy (:4000) │
+# │ │ /v1/messages │ translates Anthropic → │
+# │ │ via WireGuard wg1 │ OpenAI chat completions │
+# └─────────────┘ │ │ │
+# │ ▼ │
+# OpenCode (earth) │ vLLM engine (:11434) │
+# ┌─────────────┐ │ /v1/chat/completions │
+# │ opencode │── OpenAI API ──────> │ FlashAttention v2 │
+# │ │ /v1/chat/completions│ prefix caching │
+# └─────────────┘ │ bullpoint/Qwen3-Coder- │
+# │ Next-AWQ-4bit (45GB) │
+# └──────────────────────────────┘
+#
+# Why vLLM instead of Ollama:
+# - FlashAttention v2: ~1.5-2x faster prefill for long prompts
+# - Block-level prefix caching: partial KV cache reuse even when prompt
+# changes mid-sequence (Ollama requires exact prefix match from token 0)
+# - Chunked prefill: can interleave prefill and decode
+# - Marlin kernels for AWQ MoE quantization
+#
+# Why LiteLLM:
+# - Claude Code speaks Anthropic Messages API (/v1/messages) only
+# - vLLM speaks OpenAI Chat Completions API (/v1/chat/completions) only
+# - LiteLLM translates between them, mapping Claude model names to the
+# actual vLLM model
+#
+# Model details:
+# - Name: bullpoint/Qwen3-Coder-Next-AWQ-4bit (HuggingFace)
+# - Architecture: MoE, 80B total params, 3B active per token
+# - 512 experts, 10 activated + 1 shared per token
+# - Hybrid attention: Gated DeltaNet + Gated Attention (48 layers)
+# - Quantization: AWQ 4-bit, group size 32
+# - Disk size: ~45GB (vs ~151GB at BF16)
+# - VRAM usage: ~45GB weights + ~27GB KV cache at 92% utilization
+# - Context: 262,144 tokens (256k native)
+# - vLLM requirement: >= 0.15.0
+#
+# Hardware requirements:
+# - Minimum: 1x A100 80GB (PCIe or SXM)
+# - VRAM breakdown at gpu_memory_utilization=0.92:
+# Model weights: ~45 GiB
+# KV cache: ~27 GiB (298k tokens capacity, 4.49x concurrency at 262k)
+# CUDA graphs: ~3 GiB
+# Total: ~75 GiB / 80 GiB
+#
+# Ports:
+# 11434/tcp - vLLM OpenAI-compatible API (reuses Ollama port for firewall compat)
+# 4000/tcp - LiteLLM Anthropic-compatible proxy
+# Both restricted to 192.168.3.0/24 (WireGuard wg1 subnet)
+
+# ===========================================================================
+# STEP 1: Prerequisites
+# ===========================================================================
+# - VM with NVIDIA GPU, CUDA drivers, and Docker with nvidia-container-toolkit
+# - WireGuard wg1 tunnel already configured (see wg1-setup.sh)
+# - Ollama stopped and disabled if previously running:
+#
+# sudo systemctl stop ollama
+# sudo systemctl disable ollama
+
+# ===========================================================================
+# STEP 2: Storage setup
+# ===========================================================================
+# HuggingFace model cache on ephemeral storage (fast NVMe, survives reboots
+# on some providers but not guaranteed — model will re-download if lost).
+#
+# sudo mkdir -p /ephemeral/hug
+# sudo chmod -R 0777 /ephemeral/hug
+
+# ===========================================================================
+# STEP 3: vLLM Docker container
+# ===========================================================================
+# Pull and run vLLM. The model downloads on first start (~45GB, ~2.5 min).
+# After download, model loading takes ~65s and CUDA graph capture ~35s.
+# Total cold start: ~4-5 minutes.
+#
+# docker pull vllm/vllm-openai:latest
+#
+# docker run -d \
+# --gpus all \
+# --ipc=host \
+# --network host \
+# --name vllm_qwen3 \
+# --restart always \
+# -v /ephemeral/hug:/root/.cache/huggingface \
+# vllm/vllm-openai:latest \
+# --model bullpoint/Qwen3-Coder-Next-AWQ-4bit \
+# --tensor-parallel-size 1 \
+# --enable-auto-tool-choice \
+# --tool-call-parser qwen3_coder \
+# --enable-prefix-caching \
+# --gpu-memory-utilization 0.92 \
+# --max-model-len 262144 \
+# --host 0.0.0.0 \
+# --port 11434
+#
+# Flags explained:
+# --tensor-parallel-size 1 Single GPU (use 2/4 for multi-GPU setups)
+# --enable-auto-tool-choice Enables function/tool calling
+# --tool-call-parser qwen3_coder Parser for qwen3-coder tool format
+# --enable-prefix-caching Block-level KV cache reuse across requests
+# --gpu-memory-utilization 0.92 Use 92% of VRAM (rest for OS/overhead)
+# --max-model-len 262144 Full 256k context window
+# --port 11434 Reuse Ollama port for firewall compatibility
+#
+# Verify startup (wait for "Application startup complete"):
+# docker logs -f vllm_qwen3 2>&1 | grep -E "startup complete|Error"
+#
+# Verify model loaded:
+# curl -s http://localhost:11434/v1/models | python3 -m json.tool
+#
+# Quick inference test:
+# curl -s http://localhost:11434/v1/chat/completions \
+# -H "Content-Type: application/json" \
+# -H "Authorization: Bearer EMPTY" \
+# -d '{"model":"bullpoint/Qwen3-Coder-Next-AWQ-4bit",
+# "messages":[{"role":"user","content":"Hello"}],
+# "max_tokens":50}'
+#
+# Monitor performance (prefix cache hit rate, throughput):
+# docker logs -f vllm_qwen3 2>&1 | grep "Engine 000"
+
+# ===========================================================================
+# STEP 4: LiteLLM proxy (Anthropic API translation for Claude Code)
+# ===========================================================================
+# Install in a Python venv (Ubuntu 24.04 requires this):
+#
+# sudo apt-get install -y python3.12-venv
+# sudo mkdir -p /ephemeral/litellm-env
+# sudo chown ubuntu:ubuntu /ephemeral/litellm-env
+# python3 -m venv /ephemeral/litellm-env
+# /ephemeral/litellm-env/bin/pip install "litellm[proxy]"
+#
+# Write config file:
+#
+# sudo tee /ephemeral/litellm-config.yaml > /dev/null << "YAML"
+# model_list:
+# - model_name: "claude-sonnet-4-20250514"
+# litellm_params:
+# model: "hosted_vllm/bullpoint/Qwen3-Coder-Next-AWQ-4bit"
+# api_base: "http://localhost:11434/v1"
+# api_key: "EMPTY"
+# - model_name: "claude-opus-4-20250514"
+# litellm_params:
+# model: "hosted_vllm/bullpoint/Qwen3-Coder-Next-AWQ-4bit"
+# api_base: "http://localhost:11434/v1"
+# api_key: "EMPTY"
+# - model_name: "claude-opus-4-6-20260604"
+# litellm_params:
+# model: "hosted_vllm/bullpoint/Qwen3-Coder-Next-AWQ-4bit"
+# api_base: "http://localhost:11434/v1"
+# api_key: "EMPTY"
+# - model_name: "claude-haiku-3-5-20241022"
+# litellm_params:
+# model: "hosted_vllm/bullpoint/Qwen3-Coder-Next-AWQ-4bit"
+# api_base: "http://localhost:11434/v1"
+# api_key: "EMPTY"
+#
+# litellm_settings:
+# drop_params: true
+#
+# general_settings:
+# master_key: "sk-litellm-master"
+# YAML
+#
+# Config notes:
+# - model_name values must match what Claude Code sends (Claude model IDs)
+# - "hosted_vllm/" prefix forces LiteLLM to use /v1/chat/completions
+# (not /v1/responses which vLLM doesn't fully support for complex messages)
+# - drop_params: true — silently drops Claude-specific parameters like
+# context_management that vLLM doesn't understand
+# - master_key is the API key clients must send
+# - Add new model_name entries when Anthropic releases new model IDs
+#
+# Start LiteLLM:
+#
+# nohup /ephemeral/litellm-env/bin/litellm \
+# --config /ephemeral/litellm-config.yaml \
+# --host 0.0.0.0 \
+# --port 4000 \
+# > /ephemeral/litellm.log 2>&1 &
+#
+# Verify:
+# curl -s http://localhost:4000/v1/messages \
+# -H "Content-Type: application/json" \
+# -H "x-api-key: sk-litellm-master" \
+# -H "anthropic-version: 2023-06-01" \
+# -d '{"model":"claude-opus-4-6-20260604","max_tokens":50,
+# "messages":[{"role":"user","content":"Hello"}]}'
+#
+# For production, create a systemd service instead of nohup:
+#
+# sudo tee /etc/systemd/system/litellm.service > /dev/null << "UNIT"
+# [Unit]
+# Description=LiteLLM Proxy
+# After=network.target docker.service
+# Requires=docker.service
+#
+# [Service]
+# Type=simple
+# User=ubuntu
+# ExecStart=/ephemeral/litellm-env/bin/litellm \
+# --config /ephemeral/litellm-config.yaml \
+# --host 0.0.0.0 --port 4000
+# Restart=always
+# RestartSec=5
+#
+# [Install]
+# WantedBy=multi-user.target
+# UNIT
+#
+# sudo systemctl daemon-reload
+# sudo systemctl enable --now litellm
+
+# ===========================================================================
+# STEP 5: Firewall rules
+# ===========================================================================
+# Allow access from WireGuard subnet only:
+#
+# sudo ufw allow from 192.168.3.0/24 to any port 11434 proto tcp \
+# comment 'vLLM via wg1'
+# sudo ufw allow from 192.168.3.0/24 to any port 4000 proto tcp \
+# comment 'LiteLLM proxy via wg1'
+
+# ===========================================================================
+# STEP 6: Client configuration (on earth / local machine)
+# ===========================================================================
+#
+# --- Claude Code ---
+# Launch with environment variables pointing at LiteLLM proxy:
+#
+# ANTHROPIC_BASE_URL=http://192.168.3.1:4000 \
+# ANTHROPIC_API_KEY=sk-litellm-master \
+# claude --model claude-opus-4-6-20260604 --dangerously-skip-permissions
+#
+# Fish shell alias (add to ~/.config/fish/config.fish):
+#
+# alias claude-local='ANTHROPIC_BASE_URL=http://192.168.3.1:4000 \
+# ANTHROPIC_API_KEY=sk-litellm-master \
+# claude --model claude-opus-4-6-20260604 --dangerously-skip-permissions'
+#
+# --- OpenCode ---
+# Connects directly to vLLM (no LiteLLM needed, speaks OpenAI natively):
+#
+# OPENAI_BASE_URL=http://192.168.3.1:11434/v1 \
+# OPENAI_API_KEY=EMPTY \
+# opencode
+#
+# Model name in OpenCode config: bullpoint/Qwen3-Coder-Next-AWQ-4bit
+
+# ===========================================================================
+# STEP 7: Monitoring & troubleshooting
+# ===========================================================================
+#
+# --- Live engine stats ---
+# vLLM logs engine metrics every 10 seconds. Key fields:
+# - Avg prompt throughput: prefill speed (tokens/s), higher = faster
+# - Avg generation throughput: decode speed (tokens/s), ~40-99 on A100 PCIe
+# - GPU KV cache usage: % of KV cache memory in use (proportional to
+# active context length vs max capacity)
+# - Prefix cache hit rate: % of prompt tokens served from cache (0% for
+# Claude Code, higher for OpenCode)
+# - Running/Waiting: active and queued request counts
+#
+# Follow live (all stats):
+# docker logs -f vllm_qwen3 2>&1 | grep "Engine 000"
+#
+# Example output:
+# Engine 000: Avg prompt throughput: 5555.2 tokens/s,
+# Avg generation throughput: 49.4 tokens/s,
+# Running: 1 reqs, Waiting: 0 reqs,
+# GPU KV cache usage: 4.6%,
+# Prefix cache hit rate: 0.0%
+#
+# --- Request-level monitoring ---
+# See individual HTTP requests (method, status, duration):
+# docker logs -f vllm_qwen3 2>&1 | grep "POST"
+#
+# Example output:
+# 127.0.0.1:41864 - "POST /v1/chat/completions HTTP/1.1" 200 OK
+#
+# --- One-liner: last minute stats ---
+# Useful for periodic checks without following the log:
+# docker logs --since 1m vllm_qwen3 2>&1 | grep "Engine 000"
+#
+# --- LiteLLM proxy log ---
+# tail -f /ephemeral/litellm.log
+#
+# --- GPU hardware stats ---
+# Snapshot:
+# nvidia-smi
+#
+# Continuous (every 5 seconds):
+# nvidia-smi --query-gpu=temperature.gpu,utilization.gpu,power.draw,memory.used \
+# --format=csv -l 5
+#
+# --- Interpreting the stats ---
+#
+# Healthy baseline (A100 80GB PCIe, qwen3-coder-next AWQ 4-bit):
+# Prefill throughput: 5,000-11,000 tok/s (bursts higher during batch prefill)
+# Decode throughput: 40-99 tok/s (varies with output length per sample)
+# KV cache usage: 0-5% for short conversations, grows with context
+# (100% = 298k tokens, at which point requests queue)
+# Prefix cache hit: 0% for Claude Code (expected, it mutates prompt prefix)
+# >50% for OpenCode after a few turns
+# Temperature: 44-60C under load, <45C idle
+# Power: 70W idle, 230-240W under load, 300W max
+#
+# Warning signs:
+# - Waiting > 0 for extended periods → requests queuing, model overloaded
+# - KV cache usage near 100% → context too long, reduce --max-model-len
+# - Decode throughput < 20 tok/s sustained → possible thermal throttling
+# - Prefill throughput < 2,000 tok/s → check for CPU offload or driver issues
+#
+# Common issues:
+#
+# 1. OOM on startup with --max-model-len 262144
+# → Reduce to 131072 or 65536
+#
+# 2. "model does not exist" from vLLM
+# → Model name in LiteLLM config must exactly match HuggingFace repo name
+#
+# 3. LiteLLM returns UnsupportedParamsError
+# → Ensure drop_params: true is in litellm_settings
+#
+# 4. LiteLLM routes to /v1/responses instead of /v1/chat/completions
+# → Use "hosted_vllm/" prefix in model field, not "openai/"
+#
+# 5. Claude Code "Auth conflict" warning
+# → Run `claude /logout` first to clear the claude.ai session token,
+# then re-launch with ANTHROPIC_API_KEY=sk-litellm-master
+#
+# 6. Prefix cache hit rate stays at 0%
+# → Normal for Claude Code (it mutates the prompt prefix each turn)
+# → OpenCode should show increasing cache hit rates after a few turns
+#
+# 7. vLLM container won't start (CUDA version mismatch)
+# → Check driver version: nvidia-smi
+# → vLLM requires CUDA >= 12.x and driver >= 535
+
+# ===========================================================================
+# STEP 8: Loading / switching models
+# ===========================================================================
+#
+# vLLM serves one model per container. To switch models, stop the current
+# container and start a new one with different --model.
+#
+# --- Stop current model ---
+# docker stop vllm_qwen3
+# docker rm vllm_qwen3
+#
+# --- Run a different model ---
+# Replace --model, --name, and adjust --max-model-len and --tool-call-parser
+# as needed. The HuggingFace model downloads automatically on first start.
+#
+# Example: qwen3-coder:30b (smaller, faster, fits easily on A100 80GB)
+#
+# docker run -d \
+# --gpus all \
+# --ipc=host \
+# --network host \
+# --name vllm_qwen3_30b \
+# --restart always \
+# -v /ephemeral/hug:/root/.cache/huggingface \
+# vllm/vllm-openai:latest \
+# --model Qwen/Qwen3-Coder-30B-AWQ \
+# --tensor-parallel-size 1 \
+# --enable-auto-tool-choice \
+# --tool-call-parser qwen3_coder \
+# --enable-prefix-caching \
+# --gpu-memory-utilization 0.92 \
+# --max-model-len 131072 \
+# --host 0.0.0.0 \
+# --port 11434
+#
+# Example: full-precision model on multi-GPU (e.g. 4x H100)
+#
+# docker run -d \
+# --gpus all \
+# --ipc=host \
+# --network host \
+# --name vllm_qwen3_fp16 \
+# --restart always \
+# -v /ephemeral/hug:/root/.cache/huggingface \
+# vllm/vllm-openai:latest \
+# --model Qwen/Qwen3-Coder-Next \
+# --tensor-parallel-size 4 \
+# --enable-auto-tool-choice \
+# --tool-call-parser qwen3_coder \
+# --enable-prefix-caching \
+# --gpu-memory-utilization 0.90 \
+# --max-model-len 262144 \
+# --host 0.0.0.0 \
+# --port 11434
+#
+# --- Update LiteLLM config to match ---
+# After switching models, update the model field in litellm-config.yaml
+# to match the new HuggingFace model name:
+#
+# model: "hosted_vllm/<new-model-name>"
+#
+# Then restart LiteLLM:
+# pkill -f litellm
+# nohup /ephemeral/litellm-env/bin/litellm \
+# --config /ephemeral/litellm-config.yaml \
+# --host 0.0.0.0 --port 4000 \
+# > /ephemeral/litellm.log 2>&1 &
+#
+# --- Finding models ---
+# Search HuggingFace for vLLM-compatible quantized models:
+# https://huggingface.co/models?search=<model-name>+awq
+# https://huggingface.co/models?search=<model-name>+gptq
+#
+# Supported quantization formats in vLLM:
+# - AWQ (recommended): fast Marlin kernels, good quality
+# - GPTQ: similar to AWQ, widely available
+# - FP8: 8-bit, needs Hopper+ GPUs (H100/H200)
+# - BF16/FP16: full precision, needs more VRAM
+#
+# --- VRAM sizing guide ---
+# Rule of thumb for single A100 80GB at 92% utilization (~75 GiB usable):
+#
+# Model size (params) | AWQ 4-bit VRAM | Max context (remaining for KV)
+# ---------------------|----------------|-------------------------------
+# 7-8B | ~5 GiB | 262k+ (plenty of KV headroom)
+# 14B | ~9 GiB | 262k+ (plenty of KV headroom)
+# 30-32B | ~18 GiB | 262k (~57 GiB for KV cache)
+# 70-80B (MoE, 3B act) | ~45 GiB | 262k (~27 GiB for KV cache)
+# 70B (dense) | ~38 GiB | 131k (~37 GiB for KV cache)
+# 120B+ | won't fit | use multi-GPU or smaller quant
+#
+# If vLLM OOMs on startup, reduce --max-model-len first (halving it roughly
+# halves KV cache memory). If still OOM, reduce --gpu-memory-utilization
+# to 0.85 or try a smaller model.
+#
+# --- Verifying the new model ---
+# Check loaded model:
+# curl -s http://localhost:11434/v1/models | python3 -m json.tool
+#
+# Test inference:
+# curl -s http://localhost:11434/v1/chat/completions \
+# -H "Content-Type: application/json" \
+# -H "Authorization: Bearer EMPTY" \
+# -d '{"model":"<model-name>",
+# "messages":[{"role":"user","content":"Hello"}],
+# "max_tokens":50}'
+#
+# Test via LiteLLM (Anthropic API):
+# curl -s http://localhost:4000/v1/messages \
+# -H "Content-Type: application/json" \
+# -H "x-api-key: sk-litellm-master" \
+# -H "anthropic-version: 2023-06-01" \
+# -d '{"model":"claude-opus-4-6-20260604","max_tokens":50,
+# "messages":[{"role":"user","content":"Hello"}]}'
+
+# ===========================================================================
+# Performance characteristics (A100 80GB PCIe, single GPU)
+# ===========================================================================
+#
+# Measured on 2026-03-16 with bullpoint/Qwen3-Coder-Next-AWQ-4bit:
+#
+# vLLM prefill throughput: 5,000-11,000 tok/s (FlashAttention v2)
+# vLLM decode throughput: 40-99 tok/s (memory-bandwidth limited)
+# Per-turn latency: ~10-15s (small prompts, early conversation)
+# KV cache usage: 2-5% for typical coding sessions
+# Prefix cache hit rate: 0% (Claude Code), expected >50% (OpenCode)
+#
+# Comparison with Ollama on same hardware (A100 80GB PCIe):
+#
+# | Ollama (Q4_K_M) | vLLM (AWQ 4-bit)
+# -----------------------|-----------------------|----------------------
+# Prefill throughput | ~1,000 tok/s (est.) | 5,000-11,000 tok/s
+# Decode throughput | ~40 tok/s | 40-99 tok/s
+# Per-turn latency | ~28s (32k ctx) | ~10-15s
+# Context window | 32k (was truncating) | 262k (full, no truncation)
+# Prefix cache (Claude) | 0% always | 0% always
+# Prefix cache (OpenCode)| 85-95% when warm | expected similar or better
+# VRAM usage | 52-61 GiB | 75 GiB (more KV cache)