Add vLLM + LiteLLM support; rename script; add README

- Replace Ollama (disabled by default) with vLLM Docker container + LiteLLM Anthropic-API proxy as the default inference backend - vLLM setup: pulls vllm/vllm-openai, starts container on port 11434, polls until model is loaded (up to 10 min for first 45 GB download) - LiteLLM setup: installs in Python venv, writes config mapping Claude model aliases to the vLLM model, runs as a systemd service on port 4000 - New CLI flags on `create`: --vllm/--no-vllm, --ollama/--no-ollama to override config at runtime - New `test` command: end-to-end inference test over WireGuard against vLLM (/v1/models + /v1/chat/completions) and LiteLLM (/v1/messages) - UFW rules now open both port 11434 (inference) and 4000 (LiteLLM) from the WireGuard subnet - Rename hyperstack_vm.rb → hyperstack.rb - Add README.md with quickstart, Claude Code / OpenCode usage, CLI reference, monitoring commands, and VRAM sizing notes - Add vllm-setup.txt: detailed manual setup notes and architecture docs Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
author: Paul Buetow <paul@buetow.org> 2026-03-18 09:10:14 +0200
committer: Paul Buetow <paul@buetow.org> 2026-03-18 09:10:14 +0200
commit: d8575832ae0022f94cd786b15f8b88de0bf18672 (patch)
tree: 75872514846cfddb1434281a59b6673344023ff7 /snippets/hyperstack/vllm-setup.txt
parent: 8dca92ea40b191b9de367197aac7e1f882ed3d43 (diff)
1 files changed, 487 insertions, 0 deletions
diff --git a/snippets/hyperstack/vllm-setup.txt b/snippets/hyperstack/vllm-setup.txt
new file mode 100644
index 0000000..9ea44a7
--- /dev/null
+++ b/snippets/hyperstack/vllm-setup.txt
@@ -0,0 +1,487 @@
+# vLLM + LiteLLM + Claude Code Setup for Hyperstack VM
+#
+# This document describes the full deployment of qwen3-coder-next (AWQ 4-bit)
+# via vLLM with a LiteLLM proxy for Claude Code compatibility.
+#
+# Architecture:
+#
+#   Claude Code (earth)                    Hyperstack VM (A100 80GB)
+#   ┌─────────────┐                       ┌──────────────────────────────┐
+#   │ claude CLI   │── Anthropic API ──>  │ LiteLLM proxy (:4000)       │
+#   │              │   /v1/messages        │   translates Anthropic →    │
+#   │              │   via WireGuard wg1   │   OpenAI chat completions   │
+#   └─────────────┘                       │         │                    │
+#                                         │         ▼                    │
+#   OpenCode (earth)                      │ vLLM engine (:11434)        │
+#   ┌─────────────┐                       │   /v1/chat/completions      │
+#   │ opencode     │── OpenAI API ──────> │   FlashAttention v2         │
+#   │              │   /v1/chat/completions│   prefix caching            │
+#   └─────────────┘                       │   bullpoint/Qwen3-Coder-    │
+#                                         │     Next-AWQ-4bit (45GB)    │
+#                                         └──────────────────────────────┘
+#
+# Why vLLM instead of Ollama:
+#   - FlashAttention v2: ~1.5-2x faster prefill for long prompts
+#   - Block-level prefix caching: partial KV cache reuse even when prompt
+#     changes mid-sequence (Ollama requires exact prefix match from token 0)
+#   - Chunked prefill: can interleave prefill and decode
+#   - Marlin kernels for AWQ MoE quantization
+#
+# Why LiteLLM:
+#   - Claude Code speaks Anthropic Messages API (/v1/messages) only
+#   - vLLM speaks OpenAI Chat Completions API (/v1/chat/completions) only
+#   - LiteLLM translates between them, mapping Claude model names to the
+#     actual vLLM model
+#
+# Model details:
+#   - Name: bullpoint/Qwen3-Coder-Next-AWQ-4bit (HuggingFace)
+#   - Architecture: MoE, 80B total params, 3B active per token
+#   - 512 experts, 10 activated + 1 shared per token
+#   - Hybrid attention: Gated DeltaNet + Gated Attention (48 layers)
+#   - Quantization: AWQ 4-bit, group size 32
+#   - Disk size: ~45GB (vs ~151GB at BF16)
+#   - VRAM usage: ~45GB weights + ~27GB KV cache at 92% utilization
+#   - Context: 262,144 tokens (256k native)
+#   - vLLM requirement: >= 0.15.0
+#
+# Hardware requirements:
+#   - Minimum: 1x A100 80GB (PCIe or SXM)
+#   - VRAM breakdown at gpu_memory_utilization=0.92:
+#       Model weights:  ~45 GiB
+#       KV cache:       ~27 GiB (298k tokens capacity, 4.49x concurrency at 262k)
+#       CUDA graphs:    ~3 GiB
+#       Total:          ~75 GiB / 80 GiB
+#
+# Ports:
+#   11434/tcp - vLLM OpenAI-compatible API (reuses Ollama port for firewall compat)
+#   4000/tcp  - LiteLLM Anthropic-compatible proxy
+#   Both restricted to 192.168.3.0/24 (WireGuard wg1 subnet)
+
+# ===========================================================================
+# STEP 1: Prerequisites
+# ===========================================================================
+# - VM with NVIDIA GPU, CUDA drivers, and Docker with nvidia-container-toolkit
+# - WireGuard wg1 tunnel already configured (see wg1-setup.sh)
+# - Ollama stopped and disabled if previously running:
+#
+#   sudo systemctl stop ollama
+#   sudo systemctl disable ollama
+
+# ===========================================================================
+# STEP 2: Storage setup
+# ===========================================================================
+# HuggingFace model cache on ephemeral storage (fast NVMe, survives reboots
+# on some providers but not guaranteed — model will re-download if lost).
+#
+#   sudo mkdir -p /ephemeral/hug
+#   sudo chmod -R 0777 /ephemeral/hug
+
+# ===========================================================================
+# STEP 3: vLLM Docker container
+# ===========================================================================
+# Pull and run vLLM. The model downloads on first start (~45GB, ~2.5 min).
+# After download, model loading takes ~65s and CUDA graph capture ~35s.
+# Total cold start: ~4-5 minutes.
+#
+#   docker pull vllm/vllm-openai:latest
+#
+#   docker run -d \
+#     --gpus all \
+#     --ipc=host \
+#     --network host \
+#     --name vllm_qwen3 \
+#     --restart always \
+#     -v /ephemeral/hug:/root/.cache/huggingface \
+#     vllm/vllm-openai:latest \
+#     --model bullpoint/Qwen3-Coder-Next-AWQ-4bit \
+#     --tensor-parallel-size 1 \
+#     --enable-auto-tool-choice \
+#     --tool-call-parser qwen3_coder \
+#     --enable-prefix-caching \
+#     --gpu-memory-utilization 0.92 \
+#     --max-model-len 262144 \
+#     --host 0.0.0.0 \
+#     --port 11434
+#
+# Flags explained:
+#   --tensor-parallel-size 1    Single GPU (use 2/4 for multi-GPU setups)
+#   --enable-auto-tool-choice   Enables function/tool calling
+#   --tool-call-parser qwen3_coder   Parser for qwen3-coder tool format
+#   --enable-prefix-caching     Block-level KV cache reuse across requests
+#   --gpu-memory-utilization 0.92   Use 92% of VRAM (rest for OS/overhead)
+#   --max-model-len 262144      Full 256k context window
+#   --port 11434                Reuse Ollama port for firewall compatibility
+#
+# Verify startup (wait for "Application startup complete"):
+#   docker logs -f vllm_qwen3 2>&1 | grep -E "startup complete|Error"
+#
+# Verify model loaded:
+#   curl -s http://localhost:11434/v1/models | python3 -m json.tool
+#
+# Quick inference test:
+#   curl -s http://localhost:11434/v1/chat/completions \
+#     -H "Content-Type: application/json" \
+#     -H "Authorization: Bearer EMPTY" \
+#     -d '{"model":"bullpoint/Qwen3-Coder-Next-AWQ-4bit",
+#          "messages":[{"role":"user","content":"Hello"}],
+#          "max_tokens":50}'
+#
+# Monitor performance (prefix cache hit rate, throughput):
+#   docker logs -f vllm_qwen3 2>&1 | grep "Engine 000"
+
+# ===========================================================================
+# STEP 4: LiteLLM proxy (Anthropic API translation for Claude Code)
+# ===========================================================================
+# Install in a Python venv (Ubuntu 24.04 requires this):
+#
+#   sudo apt-get install -y python3.12-venv
+#   sudo mkdir -p /ephemeral/litellm-env
+#   sudo chown ubuntu:ubuntu /ephemeral/litellm-env
+#   python3 -m venv /ephemeral/litellm-env
+#   /ephemeral/litellm-env/bin/pip install "litellm[proxy]"
+#
+# Write config file:
+#
+#   sudo tee /ephemeral/litellm-config.yaml > /dev/null << "YAML"
+#   model_list:
+#     - model_name: "claude-sonnet-4-20250514"
+#       litellm_params:
+#         model: "hosted_vllm/bullpoint/Qwen3-Coder-Next-AWQ-4bit"
+#         api_base: "http://localhost:11434/v1"
+#         api_key: "EMPTY"
+#     - model_name: "claude-opus-4-20250514"
+#       litellm_params:
+#         model: "hosted_vllm/bullpoint/Qwen3-Coder-Next-AWQ-4bit"
+#         api_base: "http://localhost:11434/v1"
+#         api_key: "EMPTY"
+#     - model_name: "claude-opus-4-6-20260604"
+#       litellm_params:
+#         model: "hosted_vllm/bullpoint/Qwen3-Coder-Next-AWQ-4bit"
+#         api_base: "http://localhost:11434/v1"
+#         api_key: "EMPTY"
+#     - model_name: "claude-haiku-3-5-20241022"
+#       litellm_params:
+#         model: "hosted_vllm/bullpoint/Qwen3-Coder-Next-AWQ-4bit"
+#         api_base: "http://localhost:11434/v1"
+#         api_key: "EMPTY"
+#
+#   litellm_settings:
+#     drop_params: true
+#
+#   general_settings:
+#     master_key: "sk-litellm-master"
+#   YAML
+#
+# Config notes:
+#   - model_name values must match what Claude Code sends (Claude model IDs)
+#   - "hosted_vllm/" prefix forces LiteLLM to use /v1/chat/completions
+#     (not /v1/responses which vLLM doesn't fully support for complex messages)
+#   - drop_params: true — silently drops Claude-specific parameters like
+#     context_management that vLLM doesn't understand
+#   - master_key is the API key clients must send
+#   - Add new model_name entries when Anthropic releases new model IDs
+#
+# Start LiteLLM:
+#
+#   nohup /ephemeral/litellm-env/bin/litellm \
+#     --config /ephemeral/litellm-config.yaml \
+#     --host 0.0.0.0 \
+#     --port 4000 \
+#     > /ephemeral/litellm.log 2>&1 &
+#
+# Verify:
+#   curl -s http://localhost:4000/v1/messages \
+#     -H "Content-Type: application/json" \
+#     -H "x-api-key: sk-litellm-master" \
+#     -H "anthropic-version: 2023-06-01" \
+#     -d '{"model":"claude-opus-4-6-20260604","max_tokens":50,
+#          "messages":[{"role":"user","content":"Hello"}]}'
+#
+# For production, create a systemd service instead of nohup:
+#
+#   sudo tee /etc/systemd/system/litellm.service > /dev/null << "UNIT"
+#   [Unit]
+#   Description=LiteLLM Proxy
+#   After=network.target docker.service
+#   Requires=docker.service
+#
+#   [Service]
+#   Type=simple
+#   User=ubuntu
+#   ExecStart=/ephemeral/litellm-env/bin/litellm \
+#     --config /ephemeral/litellm-config.yaml \
+#     --host 0.0.0.0 --port 4000
+#   Restart=always
+#   RestartSec=5
+#
+#   [Install]
+#   WantedBy=multi-user.target
+#   UNIT
+#
+#   sudo systemctl daemon-reload
+#   sudo systemctl enable --now litellm
+
+# ===========================================================================
+# STEP 5: Firewall rules
+# ===========================================================================
+# Allow access from WireGuard subnet only:
+#
+#   sudo ufw allow from 192.168.3.0/24 to any port 11434 proto tcp \
+#     comment 'vLLM via wg1'
+#   sudo ufw allow from 192.168.3.0/24 to any port 4000 proto tcp \
+#     comment 'LiteLLM proxy via wg1'
+
+# ===========================================================================
+# STEP 6: Client configuration (on earth / local machine)
+# ===========================================================================
+#
+# --- Claude Code ---
+# Launch with environment variables pointing at LiteLLM proxy:
+#
+#   ANTHROPIC_BASE_URL=http://192.168.3.1:4000 \
+#   ANTHROPIC_API_KEY=sk-litellm-master \
+#   claude --model claude-opus-4-6-20260604 --dangerously-skip-permissions
+#
+# Fish shell alias (add to ~/.config/fish/config.fish):
+#
+#   alias claude-local='ANTHROPIC_BASE_URL=http://192.168.3.1:4000 \
+#     ANTHROPIC_API_KEY=sk-litellm-master \
+#     claude --model claude-opus-4-6-20260604 --dangerously-skip-permissions'
+#
+# --- OpenCode ---
+# Connects directly to vLLM (no LiteLLM needed, speaks OpenAI natively):
+#
+#   OPENAI_BASE_URL=http://192.168.3.1:11434/v1 \
+#   OPENAI_API_KEY=EMPTY \
+#   opencode
+#
+# Model name in OpenCode config: bullpoint/Qwen3-Coder-Next-AWQ-4bit
+
+# ===========================================================================
+# STEP 7: Monitoring & troubleshooting
+# ===========================================================================
+#
+# --- Live engine stats ---
+# vLLM logs engine metrics every 10 seconds. Key fields:
+#   - Avg prompt throughput:     prefill speed (tokens/s), higher = faster
+#   - Avg generation throughput: decode speed (tokens/s), ~40-99 on A100 PCIe
+#   - GPU KV cache usage:        % of KV cache memory in use (proportional to
+#                                 active context length vs max capacity)
+#   - Prefix cache hit rate:     % of prompt tokens served from cache (0% for
+#                                 Claude Code, higher for OpenCode)
+#   - Running/Waiting:           active and queued request counts
+#
+# Follow live (all stats):
+#   docker logs -f vllm_qwen3 2>&1 | grep "Engine 000"
+#
+# Example output:
+#   Engine 000: Avg prompt throughput: 5555.2 tokens/s,
+#               Avg generation throughput: 49.4 tokens/s,
+#               Running: 1 reqs, Waiting: 0 reqs,
+#               GPU KV cache usage: 4.6%,
+#               Prefix cache hit rate: 0.0%
+#
+# --- Request-level monitoring ---
+# See individual HTTP requests (method, status, duration):
+#   docker logs -f vllm_qwen3 2>&1 | grep "POST"
+#
+# Example output:
+#   127.0.0.1:41864 - "POST /v1/chat/completions HTTP/1.1" 200 OK
+#
+# --- One-liner: last minute stats ---
+# Useful for periodic checks without following the log:
+#   docker logs --since 1m vllm_qwen3 2>&1 | grep "Engine 000"
+#
+# --- LiteLLM proxy log ---
+#   tail -f /ephemeral/litellm.log
+#
+# --- GPU hardware stats ---
+# Snapshot:
+#   nvidia-smi
+#
+# Continuous (every 5 seconds):
+#   nvidia-smi --query-gpu=temperature.gpu,utilization.gpu,power.draw,memory.used \
+#     --format=csv -l 5
+#
+# --- Interpreting the stats ---
+#
+# Healthy baseline (A100 80GB PCIe, qwen3-coder-next AWQ 4-bit):
+#   Prefill throughput:   5,000-11,000 tok/s (bursts higher during batch prefill)
+#   Decode throughput:    40-99 tok/s (varies with output length per sample)
+#   KV cache usage:       0-5% for short conversations, grows with context
+#                         (100% = 298k tokens, at which point requests queue)
+#   Prefix cache hit:     0% for Claude Code (expected, it mutates prompt prefix)
+#                         >50% for OpenCode after a few turns
+#   Temperature:          44-60C under load, <45C idle
+#   Power:                70W idle, 230-240W under load, 300W max
+#
+# Warning signs:
+#   - Waiting > 0 for extended periods → requests queuing, model overloaded
+#   - KV cache usage near 100% → context too long, reduce --max-model-len
+#   - Decode throughput < 20 tok/s sustained → possible thermal throttling
+#   - Prefill throughput < 2,000 tok/s → check for CPU offload or driver issues
+#
+# Common issues:
+#
+# 1. OOM on startup with --max-model-len 262144
+#    → Reduce to 131072 or 65536
+#
+# 2. "model does not exist" from vLLM
+#    → Model name in LiteLLM config must exactly match HuggingFace repo name
+#
+# 3. LiteLLM returns UnsupportedParamsError
+#    → Ensure drop_params: true is in litellm_settings
+#
+# 4. LiteLLM routes to /v1/responses instead of /v1/chat/completions
+#    → Use "hosted_vllm/" prefix in model field, not "openai/"
+#
+# 5. Claude Code "Auth conflict" warning
+#    → Run `claude /logout` first to clear the claude.ai session token,
+#      then re-launch with ANTHROPIC_API_KEY=sk-litellm-master
+#
+# 6. Prefix cache hit rate stays at 0%
+#    → Normal for Claude Code (it mutates the prompt prefix each turn)
+#    → OpenCode should show increasing cache hit rates after a few turns
+#
+# 7. vLLM container won't start (CUDA version mismatch)
+#    → Check driver version: nvidia-smi
+#    → vLLM requires CUDA >= 12.x and driver >= 535
+
+# ===========================================================================
+# STEP 8: Loading / switching models
+# ===========================================================================
+#
+# vLLM serves one model per container. To switch models, stop the current
+# container and start a new one with different --model.
+#
+# --- Stop current model ---
+#   docker stop vllm_qwen3
+#   docker rm vllm_qwen3
+#
+# --- Run a different model ---
+# Replace --model, --name, and adjust --max-model-len and --tool-call-parser
+# as needed. The HuggingFace model downloads automatically on first start.
+#
+# Example: qwen3-coder:30b (smaller, faster, fits easily on A100 80GB)
+#
+#   docker run -d \
+#     --gpus all \
+#     --ipc=host \
+#     --network host \
+#     --name vllm_qwen3_30b \
+#     --restart always \
+#     -v /ephemeral/hug:/root/.cache/huggingface \
+#     vllm/vllm-openai:latest \
+#     --model Qwen/Qwen3-Coder-30B-AWQ \
+#     --tensor-parallel-size 1 \
+#     --enable-auto-tool-choice \
+#     --tool-call-parser qwen3_coder \
+#     --enable-prefix-caching \
+#     --gpu-memory-utilization 0.92 \
+#     --max-model-len 131072 \
+#     --host 0.0.0.0 \
+#     --port 11434
+#
+# Example: full-precision model on multi-GPU (e.g. 4x H100)
+#
+#   docker run -d \
+#     --gpus all \
+#     --ipc=host \
+#     --network host \
+#     --name vllm_qwen3_fp16 \
+#     --restart always \
+#     -v /ephemeral/hug:/root/.cache/huggingface \
+#     vllm/vllm-openai:latest \
+#     --model Qwen/Qwen3-Coder-Next \
+#     --tensor-parallel-size 4 \
+#     --enable-auto-tool-choice \
+#     --tool-call-parser qwen3_coder \
+#     --enable-prefix-caching \
+#     --gpu-memory-utilization 0.90 \
+#     --max-model-len 262144 \
+#     --host 0.0.0.0 \
+#     --port 11434
+#
+# --- Update LiteLLM config to match ---
+# After switching models, update the model field in litellm-config.yaml
+# to match the new HuggingFace model name:
+#
+#   model: "hosted_vllm/<new-model-name>"
+#
+# Then restart LiteLLM:
+#   pkill -f litellm
+#   nohup /ephemeral/litellm-env/bin/litellm \
+#     --config /ephemeral/litellm-config.yaml \
+#     --host 0.0.0.0 --port 4000 \
+#     > /ephemeral/litellm.log 2>&1 &
+#
+# --- Finding models ---
+# Search HuggingFace for vLLM-compatible quantized models:
+#   https://huggingface.co/models?search=<model-name>+awq
+#   https://huggingface.co/models?search=<model-name>+gptq
+#
+# Supported quantization formats in vLLM:
+#   - AWQ (recommended): fast Marlin kernels, good quality
+#   - GPTQ: similar to AWQ, widely available
+#   - FP8: 8-bit, needs Hopper+ GPUs (H100/H200)
+#   - BF16/FP16: full precision, needs more VRAM
+#
+# --- VRAM sizing guide ---
+# Rule of thumb for single A100 80GB at 92% utilization (~75 GiB usable):
+#
+#   Model size (params)  | AWQ 4-bit VRAM | Max context (remaining for KV)
+#   ---------------------|----------------|-------------------------------
+#   7-8B                 | ~5 GiB         | 262k+ (plenty of KV headroom)
+#   14B                  | ~9 GiB         | 262k+ (plenty of KV headroom)
+#   30-32B               | ~18 GiB        | 262k  (~57 GiB for KV cache)
+#   70-80B (MoE, 3B act) | ~45 GiB        | 262k  (~27 GiB for KV cache)
+#   70B (dense)          | ~38 GiB        | 131k  (~37 GiB for KV cache)
+#   120B+                | won't fit      | use multi-GPU or smaller quant
+#
+# If vLLM OOMs on startup, reduce --max-model-len first (halving it roughly
+# halves KV cache memory). If still OOM, reduce --gpu-memory-utilization
+# to 0.85 or try a smaller model.
+#
+# --- Verifying the new model ---
+# Check loaded model:
+#   curl -s http://localhost:11434/v1/models | python3 -m json.tool
+#
+# Test inference:
+#   curl -s http://localhost:11434/v1/chat/completions \
+#     -H "Content-Type: application/json" \
+#     -H "Authorization: Bearer EMPTY" \
+#     -d '{"model":"<model-name>",
+#          "messages":[{"role":"user","content":"Hello"}],
+#          "max_tokens":50}'
+#
+# Test via LiteLLM (Anthropic API):
+#   curl -s http://localhost:4000/v1/messages \
+#     -H "Content-Type: application/json" \
+#     -H "x-api-key: sk-litellm-master" \
+#     -H "anthropic-version: 2023-06-01" \
+#     -d '{"model":"claude-opus-4-6-20260604","max_tokens":50,
+#          "messages":[{"role":"user","content":"Hello"}]}'
+
+# ===========================================================================
+# Performance characteristics (A100 80GB PCIe, single GPU)
+# ===========================================================================
+#
+# Measured on 2026-03-16 with bullpoint/Qwen3-Coder-Next-AWQ-4bit:
+#
+#   vLLM prefill throughput:    5,000-11,000 tok/s (FlashAttention v2)
+#   vLLM decode throughput:     40-99 tok/s (memory-bandwidth limited)
+#   Per-turn latency:           ~10-15s (small prompts, early conversation)
+#   KV cache usage:             2-5% for typical coding sessions
+#   Prefix cache hit rate:      0% (Claude Code), expected >50% (OpenCode)
+#
+# Comparison with Ollama on same hardware (A100 80GB PCIe):
+#
+#                          | Ollama (Q4_K_M)       | vLLM (AWQ 4-bit)
+#   -----------------------|-----------------------|----------------------
+#   Prefill throughput     | ~1,000 tok/s (est.)   | 5,000-11,000 tok/s
+#   Decode throughput      | ~40 tok/s             | 40-99 tok/s
+#   Per-turn latency       | ~28s (32k ctx)        | ~10-15s
+#   Context window         | 32k (was truncating)  | 262k (full, no truncation)
+#   Prefix cache (Claude)  | 0% always             | 0% always
+#   Prefix cache (OpenCode)| 85-95% when warm      | expected similar or better
+#   VRAM usage             | 52-61 GiB             | 75 GiB (more KV cache)
author	Paul Buetow <paul@buetow.org>	2026-03-18 09:10:14 +0200
committer	Paul Buetow <paul@buetow.org>	2026-03-18 09:10:14 +0200
commit	d8575832ae0022f94cd786b15f8b88de0bf18672 (patch)
tree	75872514846cfddb1434281a59b6673344023ff7 /snippets/hyperstack/vllm-setup.txt
parent	8dca92ea40b191b9de367197aac7e1f882ed3d43 (diff)