diff options
| author | Paul Buetow <paul@buetow.org> | 2026-03-18 09:10:14 +0200 |
|---|---|---|
| committer | Paul Buetow <paul@buetow.org> | 2026-03-18 09:10:14 +0200 |
| commit | d8575832ae0022f94cd786b15f8b88de0bf18672 (patch) | |
| tree | 75872514846cfddb1434281a59b6673344023ff7 /snippets/hyperstack/vllm-setup.txt | |
| parent | 8dca92ea40b191b9de367197aac7e1f882ed3d43 (diff) | |
Add vLLM + LiteLLM support; rename script; add README
- Replace Ollama (disabled by default) with vLLM Docker container +
LiteLLM Anthropic-API proxy as the default inference backend
- vLLM setup: pulls vllm/vllm-openai, starts container on port 11434,
polls until model is loaded (up to 10 min for first 45 GB download)
- LiteLLM setup: installs in Python venv, writes config mapping Claude
model aliases to the vLLM model, runs as a systemd service on port 4000
- New CLI flags on `create`: --vllm/--no-vllm, --ollama/--no-ollama to
override config at runtime
- New `test` command: end-to-end inference test over WireGuard against
vLLM (/v1/models + /v1/chat/completions) and LiteLLM (/v1/messages)
- UFW rules now open both port 11434 (inference) and 4000 (LiteLLM)
from the WireGuard subnet
- Rename hyperstack_vm.rb → hyperstack.rb
- Add README.md with quickstart, Claude Code / OpenCode usage, CLI
reference, monitoring commands, and VRAM sizing notes
- Add vllm-setup.txt: detailed manual setup notes and architecture docs
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Diffstat (limited to 'snippets/hyperstack/vllm-setup.txt')
| -rw-r--r-- | snippets/hyperstack/vllm-setup.txt | 487 |
1 files changed, 487 insertions, 0 deletions
diff --git a/snippets/hyperstack/vllm-setup.txt b/snippets/hyperstack/vllm-setup.txt new file mode 100644 index 0000000..9ea44a7 --- /dev/null +++ b/snippets/hyperstack/vllm-setup.txt @@ -0,0 +1,487 @@ +# vLLM + LiteLLM + Claude Code Setup for Hyperstack VM +# +# This document describes the full deployment of qwen3-coder-next (AWQ 4-bit) +# via vLLM with a LiteLLM proxy for Claude Code compatibility. +# +# Architecture: +# +# Claude Code (earth) Hyperstack VM (A100 80GB) +# ┌─────────────┐ ┌──────────────────────────────┐ +# │ claude CLI │── Anthropic API ──> │ LiteLLM proxy (:4000) │ +# │ │ /v1/messages │ translates Anthropic → │ +# │ │ via WireGuard wg1 │ OpenAI chat completions │ +# └─────────────┘ │ │ │ +# │ ▼ │ +# OpenCode (earth) │ vLLM engine (:11434) │ +# ┌─────────────┐ │ /v1/chat/completions │ +# │ opencode │── OpenAI API ──────> │ FlashAttention v2 │ +# │ │ /v1/chat/completions│ prefix caching │ +# └─────────────┘ │ bullpoint/Qwen3-Coder- │ +# │ Next-AWQ-4bit (45GB) │ +# └──────────────────────────────┘ +# +# Why vLLM instead of Ollama: +# - FlashAttention v2: ~1.5-2x faster prefill for long prompts +# - Block-level prefix caching: partial KV cache reuse even when prompt +# changes mid-sequence (Ollama requires exact prefix match from token 0) +# - Chunked prefill: can interleave prefill and decode +# - Marlin kernels for AWQ MoE quantization +# +# Why LiteLLM: +# - Claude Code speaks Anthropic Messages API (/v1/messages) only +# - vLLM speaks OpenAI Chat Completions API (/v1/chat/completions) only +# - LiteLLM translates between them, mapping Claude model names to the +# actual vLLM model +# +# Model details: +# - Name: bullpoint/Qwen3-Coder-Next-AWQ-4bit (HuggingFace) +# - Architecture: MoE, 80B total params, 3B active per token +# - 512 experts, 10 activated + 1 shared per token +# - Hybrid attention: Gated DeltaNet + Gated Attention (48 layers) +# - Quantization: AWQ 4-bit, group size 32 +# - Disk size: ~45GB (vs ~151GB at BF16) +# - VRAM usage: ~45GB weights + ~27GB KV cache at 92% utilization +# - Context: 262,144 tokens (256k native) +# - vLLM requirement: >= 0.15.0 +# +# Hardware requirements: +# - Minimum: 1x A100 80GB (PCIe or SXM) +# - VRAM breakdown at gpu_memory_utilization=0.92: +# Model weights: ~45 GiB +# KV cache: ~27 GiB (298k tokens capacity, 4.49x concurrency at 262k) +# CUDA graphs: ~3 GiB +# Total: ~75 GiB / 80 GiB +# +# Ports: +# 11434/tcp - vLLM OpenAI-compatible API (reuses Ollama port for firewall compat) +# 4000/tcp - LiteLLM Anthropic-compatible proxy +# Both restricted to 192.168.3.0/24 (WireGuard wg1 subnet) + +# =========================================================================== +# STEP 1: Prerequisites +# =========================================================================== +# - VM with NVIDIA GPU, CUDA drivers, and Docker with nvidia-container-toolkit +# - WireGuard wg1 tunnel already configured (see wg1-setup.sh) +# - Ollama stopped and disabled if previously running: +# +# sudo systemctl stop ollama +# sudo systemctl disable ollama + +# =========================================================================== +# STEP 2: Storage setup +# =========================================================================== +# HuggingFace model cache on ephemeral storage (fast NVMe, survives reboots +# on some providers but not guaranteed — model will re-download if lost). +# +# sudo mkdir -p /ephemeral/hug +# sudo chmod -R 0777 /ephemeral/hug + +# =========================================================================== +# STEP 3: vLLM Docker container +# =========================================================================== +# Pull and run vLLM. The model downloads on first start (~45GB, ~2.5 min). +# After download, model loading takes ~65s and CUDA graph capture ~35s. +# Total cold start: ~4-5 minutes. +# +# docker pull vllm/vllm-openai:latest +# +# docker run -d \ +# --gpus all \ +# --ipc=host \ +# --network host \ +# --name vllm_qwen3 \ +# --restart always \ +# -v /ephemeral/hug:/root/.cache/huggingface \ +# vllm/vllm-openai:latest \ +# --model bullpoint/Qwen3-Coder-Next-AWQ-4bit \ +# --tensor-parallel-size 1 \ +# --enable-auto-tool-choice \ +# --tool-call-parser qwen3_coder \ +# --enable-prefix-caching \ +# --gpu-memory-utilization 0.92 \ +# --max-model-len 262144 \ +# --host 0.0.0.0 \ +# --port 11434 +# +# Flags explained: +# --tensor-parallel-size 1 Single GPU (use 2/4 for multi-GPU setups) +# --enable-auto-tool-choice Enables function/tool calling +# --tool-call-parser qwen3_coder Parser for qwen3-coder tool format +# --enable-prefix-caching Block-level KV cache reuse across requests +# --gpu-memory-utilization 0.92 Use 92% of VRAM (rest for OS/overhead) +# --max-model-len 262144 Full 256k context window +# --port 11434 Reuse Ollama port for firewall compatibility +# +# Verify startup (wait for "Application startup complete"): +# docker logs -f vllm_qwen3 2>&1 | grep -E "startup complete|Error" +# +# Verify model loaded: +# curl -s http://localhost:11434/v1/models | python3 -m json.tool +# +# Quick inference test: +# curl -s http://localhost:11434/v1/chat/completions \ +# -H "Content-Type: application/json" \ +# -H "Authorization: Bearer EMPTY" \ +# -d '{"model":"bullpoint/Qwen3-Coder-Next-AWQ-4bit", +# "messages":[{"role":"user","content":"Hello"}], +# "max_tokens":50}' +# +# Monitor performance (prefix cache hit rate, throughput): +# docker logs -f vllm_qwen3 2>&1 | grep "Engine 000" + +# =========================================================================== +# STEP 4: LiteLLM proxy (Anthropic API translation for Claude Code) +# =========================================================================== +# Install in a Python venv (Ubuntu 24.04 requires this): +# +# sudo apt-get install -y python3.12-venv +# sudo mkdir -p /ephemeral/litellm-env +# sudo chown ubuntu:ubuntu /ephemeral/litellm-env +# python3 -m venv /ephemeral/litellm-env +# /ephemeral/litellm-env/bin/pip install "litellm[proxy]" +# +# Write config file: +# +# sudo tee /ephemeral/litellm-config.yaml > /dev/null << "YAML" +# model_list: +# - model_name: "claude-sonnet-4-20250514" +# litellm_params: +# model: "hosted_vllm/bullpoint/Qwen3-Coder-Next-AWQ-4bit" +# api_base: "http://localhost:11434/v1" +# api_key: "EMPTY" +# - model_name: "claude-opus-4-20250514" +# litellm_params: +# model: "hosted_vllm/bullpoint/Qwen3-Coder-Next-AWQ-4bit" +# api_base: "http://localhost:11434/v1" +# api_key: "EMPTY" +# - model_name: "claude-opus-4-6-20260604" +# litellm_params: +# model: "hosted_vllm/bullpoint/Qwen3-Coder-Next-AWQ-4bit" +# api_base: "http://localhost:11434/v1" +# api_key: "EMPTY" +# - model_name: "claude-haiku-3-5-20241022" +# litellm_params: +# model: "hosted_vllm/bullpoint/Qwen3-Coder-Next-AWQ-4bit" +# api_base: "http://localhost:11434/v1" +# api_key: "EMPTY" +# +# litellm_settings: +# drop_params: true +# +# general_settings: +# master_key: "sk-litellm-master" +# YAML +# +# Config notes: +# - model_name values must match what Claude Code sends (Claude model IDs) +# - "hosted_vllm/" prefix forces LiteLLM to use /v1/chat/completions +# (not /v1/responses which vLLM doesn't fully support for complex messages) +# - drop_params: true — silently drops Claude-specific parameters like +# context_management that vLLM doesn't understand +# - master_key is the API key clients must send +# - Add new model_name entries when Anthropic releases new model IDs +# +# Start LiteLLM: +# +# nohup /ephemeral/litellm-env/bin/litellm \ +# --config /ephemeral/litellm-config.yaml \ +# --host 0.0.0.0 \ +# --port 4000 \ +# > /ephemeral/litellm.log 2>&1 & +# +# Verify: +# curl -s http://localhost:4000/v1/messages \ +# -H "Content-Type: application/json" \ +# -H "x-api-key: sk-litellm-master" \ +# -H "anthropic-version: 2023-06-01" \ +# -d '{"model":"claude-opus-4-6-20260604","max_tokens":50, +# "messages":[{"role":"user","content":"Hello"}]}' +# +# For production, create a systemd service instead of nohup: +# +# sudo tee /etc/systemd/system/litellm.service > /dev/null << "UNIT" +# [Unit] +# Description=LiteLLM Proxy +# After=network.target docker.service +# Requires=docker.service +# +# [Service] +# Type=simple +# User=ubuntu +# ExecStart=/ephemeral/litellm-env/bin/litellm \ +# --config /ephemeral/litellm-config.yaml \ +# --host 0.0.0.0 --port 4000 +# Restart=always +# RestartSec=5 +# +# [Install] +# WantedBy=multi-user.target +# UNIT +# +# sudo systemctl daemon-reload +# sudo systemctl enable --now litellm + +# =========================================================================== +# STEP 5: Firewall rules +# =========================================================================== +# Allow access from WireGuard subnet only: +# +# sudo ufw allow from 192.168.3.0/24 to any port 11434 proto tcp \ +# comment 'vLLM via wg1' +# sudo ufw allow from 192.168.3.0/24 to any port 4000 proto tcp \ +# comment 'LiteLLM proxy via wg1' + +# =========================================================================== +# STEP 6: Client configuration (on earth / local machine) +# =========================================================================== +# +# --- Claude Code --- +# Launch with environment variables pointing at LiteLLM proxy: +# +# ANTHROPIC_BASE_URL=http://192.168.3.1:4000 \ +# ANTHROPIC_API_KEY=sk-litellm-master \ +# claude --model claude-opus-4-6-20260604 --dangerously-skip-permissions +# +# Fish shell alias (add to ~/.config/fish/config.fish): +# +# alias claude-local='ANTHROPIC_BASE_URL=http://192.168.3.1:4000 \ +# ANTHROPIC_API_KEY=sk-litellm-master \ +# claude --model claude-opus-4-6-20260604 --dangerously-skip-permissions' +# +# --- OpenCode --- +# Connects directly to vLLM (no LiteLLM needed, speaks OpenAI natively): +# +# OPENAI_BASE_URL=http://192.168.3.1:11434/v1 \ +# OPENAI_API_KEY=EMPTY \ +# opencode +# +# Model name in OpenCode config: bullpoint/Qwen3-Coder-Next-AWQ-4bit + +# =========================================================================== +# STEP 7: Monitoring & troubleshooting +# =========================================================================== +# +# --- Live engine stats --- +# vLLM logs engine metrics every 10 seconds. Key fields: +# - Avg prompt throughput: prefill speed (tokens/s), higher = faster +# - Avg generation throughput: decode speed (tokens/s), ~40-99 on A100 PCIe +# - GPU KV cache usage: % of KV cache memory in use (proportional to +# active context length vs max capacity) +# - Prefix cache hit rate: % of prompt tokens served from cache (0% for +# Claude Code, higher for OpenCode) +# - Running/Waiting: active and queued request counts +# +# Follow live (all stats): +# docker logs -f vllm_qwen3 2>&1 | grep "Engine 000" +# +# Example output: +# Engine 000: Avg prompt throughput: 5555.2 tokens/s, +# Avg generation throughput: 49.4 tokens/s, +# Running: 1 reqs, Waiting: 0 reqs, +# GPU KV cache usage: 4.6%, +# Prefix cache hit rate: 0.0% +# +# --- Request-level monitoring --- +# See individual HTTP requests (method, status, duration): +# docker logs -f vllm_qwen3 2>&1 | grep "POST" +# +# Example output: +# 127.0.0.1:41864 - "POST /v1/chat/completions HTTP/1.1" 200 OK +# +# --- One-liner: last minute stats --- +# Useful for periodic checks without following the log: +# docker logs --since 1m vllm_qwen3 2>&1 | grep "Engine 000" +# +# --- LiteLLM proxy log --- +# tail -f /ephemeral/litellm.log +# +# --- GPU hardware stats --- +# Snapshot: +# nvidia-smi +# +# Continuous (every 5 seconds): +# nvidia-smi --query-gpu=temperature.gpu,utilization.gpu,power.draw,memory.used \ +# --format=csv -l 5 +# +# --- Interpreting the stats --- +# +# Healthy baseline (A100 80GB PCIe, qwen3-coder-next AWQ 4-bit): +# Prefill throughput: 5,000-11,000 tok/s (bursts higher during batch prefill) +# Decode throughput: 40-99 tok/s (varies with output length per sample) +# KV cache usage: 0-5% for short conversations, grows with context +# (100% = 298k tokens, at which point requests queue) +# Prefix cache hit: 0% for Claude Code (expected, it mutates prompt prefix) +# >50% for OpenCode after a few turns +# Temperature: 44-60C under load, <45C idle +# Power: 70W idle, 230-240W under load, 300W max +# +# Warning signs: +# - Waiting > 0 for extended periods → requests queuing, model overloaded +# - KV cache usage near 100% → context too long, reduce --max-model-len +# - Decode throughput < 20 tok/s sustained → possible thermal throttling +# - Prefill throughput < 2,000 tok/s → check for CPU offload or driver issues +# +# Common issues: +# +# 1. OOM on startup with --max-model-len 262144 +# → Reduce to 131072 or 65536 +# +# 2. "model does not exist" from vLLM +# → Model name in LiteLLM config must exactly match HuggingFace repo name +# +# 3. LiteLLM returns UnsupportedParamsError +# → Ensure drop_params: true is in litellm_settings +# +# 4. LiteLLM routes to /v1/responses instead of /v1/chat/completions +# → Use "hosted_vllm/" prefix in model field, not "openai/" +# +# 5. Claude Code "Auth conflict" warning +# → Run `claude /logout` first to clear the claude.ai session token, +# then re-launch with ANTHROPIC_API_KEY=sk-litellm-master +# +# 6. Prefix cache hit rate stays at 0% +# → Normal for Claude Code (it mutates the prompt prefix each turn) +# → OpenCode should show increasing cache hit rates after a few turns +# +# 7. vLLM container won't start (CUDA version mismatch) +# → Check driver version: nvidia-smi +# → vLLM requires CUDA >= 12.x and driver >= 535 + +# =========================================================================== +# STEP 8: Loading / switching models +# =========================================================================== +# +# vLLM serves one model per container. To switch models, stop the current +# container and start a new one with different --model. +# +# --- Stop current model --- +# docker stop vllm_qwen3 +# docker rm vllm_qwen3 +# +# --- Run a different model --- +# Replace --model, --name, and adjust --max-model-len and --tool-call-parser +# as needed. The HuggingFace model downloads automatically on first start. +# +# Example: qwen3-coder:30b (smaller, faster, fits easily on A100 80GB) +# +# docker run -d \ +# --gpus all \ +# --ipc=host \ +# --network host \ +# --name vllm_qwen3_30b \ +# --restart always \ +# -v /ephemeral/hug:/root/.cache/huggingface \ +# vllm/vllm-openai:latest \ +# --model Qwen/Qwen3-Coder-30B-AWQ \ +# --tensor-parallel-size 1 \ +# --enable-auto-tool-choice \ +# --tool-call-parser qwen3_coder \ +# --enable-prefix-caching \ +# --gpu-memory-utilization 0.92 \ +# --max-model-len 131072 \ +# --host 0.0.0.0 \ +# --port 11434 +# +# Example: full-precision model on multi-GPU (e.g. 4x H100) +# +# docker run -d \ +# --gpus all \ +# --ipc=host \ +# --network host \ +# --name vllm_qwen3_fp16 \ +# --restart always \ +# -v /ephemeral/hug:/root/.cache/huggingface \ +# vllm/vllm-openai:latest \ +# --model Qwen/Qwen3-Coder-Next \ +# --tensor-parallel-size 4 \ +# --enable-auto-tool-choice \ +# --tool-call-parser qwen3_coder \ +# --enable-prefix-caching \ +# --gpu-memory-utilization 0.90 \ +# --max-model-len 262144 \ +# --host 0.0.0.0 \ +# --port 11434 +# +# --- Update LiteLLM config to match --- +# After switching models, update the model field in litellm-config.yaml +# to match the new HuggingFace model name: +# +# model: "hosted_vllm/<new-model-name>" +# +# Then restart LiteLLM: +# pkill -f litellm +# nohup /ephemeral/litellm-env/bin/litellm \ +# --config /ephemeral/litellm-config.yaml \ +# --host 0.0.0.0 --port 4000 \ +# > /ephemeral/litellm.log 2>&1 & +# +# --- Finding models --- +# Search HuggingFace for vLLM-compatible quantized models: +# https://huggingface.co/models?search=<model-name>+awq +# https://huggingface.co/models?search=<model-name>+gptq +# +# Supported quantization formats in vLLM: +# - AWQ (recommended): fast Marlin kernels, good quality +# - GPTQ: similar to AWQ, widely available +# - FP8: 8-bit, needs Hopper+ GPUs (H100/H200) +# - BF16/FP16: full precision, needs more VRAM +# +# --- VRAM sizing guide --- +# Rule of thumb for single A100 80GB at 92% utilization (~75 GiB usable): +# +# Model size (params) | AWQ 4-bit VRAM | Max context (remaining for KV) +# ---------------------|----------------|------------------------------- +# 7-8B | ~5 GiB | 262k+ (plenty of KV headroom) +# 14B | ~9 GiB | 262k+ (plenty of KV headroom) +# 30-32B | ~18 GiB | 262k (~57 GiB for KV cache) +# 70-80B (MoE, 3B act) | ~45 GiB | 262k (~27 GiB for KV cache) +# 70B (dense) | ~38 GiB | 131k (~37 GiB for KV cache) +# 120B+ | won't fit | use multi-GPU or smaller quant +# +# If vLLM OOMs on startup, reduce --max-model-len first (halving it roughly +# halves KV cache memory). If still OOM, reduce --gpu-memory-utilization +# to 0.85 or try a smaller model. +# +# --- Verifying the new model --- +# Check loaded model: +# curl -s http://localhost:11434/v1/models | python3 -m json.tool +# +# Test inference: +# curl -s http://localhost:11434/v1/chat/completions \ +# -H "Content-Type: application/json" \ +# -H "Authorization: Bearer EMPTY" \ +# -d '{"model":"<model-name>", +# "messages":[{"role":"user","content":"Hello"}], +# "max_tokens":50}' +# +# Test via LiteLLM (Anthropic API): +# curl -s http://localhost:4000/v1/messages \ +# -H "Content-Type: application/json" \ +# -H "x-api-key: sk-litellm-master" \ +# -H "anthropic-version: 2023-06-01" \ +# -d '{"model":"claude-opus-4-6-20260604","max_tokens":50, +# "messages":[{"role":"user","content":"Hello"}]}' + +# =========================================================================== +# Performance characteristics (A100 80GB PCIe, single GPU) +# =========================================================================== +# +# Measured on 2026-03-16 with bullpoint/Qwen3-Coder-Next-AWQ-4bit: +# +# vLLM prefill throughput: 5,000-11,000 tok/s (FlashAttention v2) +# vLLM decode throughput: 40-99 tok/s (memory-bandwidth limited) +# Per-turn latency: ~10-15s (small prompts, early conversation) +# KV cache usage: 2-5% for typical coding sessions +# Prefix cache hit rate: 0% (Claude Code), expected >50% (OpenCode) +# +# Comparison with Ollama on same hardware (A100 80GB PCIe): +# +# | Ollama (Q4_K_M) | vLLM (AWQ 4-bit) +# -----------------------|-----------------------|---------------------- +# Prefill throughput | ~1,000 tok/s (est.) | 5,000-11,000 tok/s +# Decode throughput | ~40 tok/s | 40-99 tok/s +# Per-turn latency | ~28s (32k ctx) | ~10-15s +# Context window | 32k (was truncating) | 262k (full, no truncation) +# Prefix cache (Claude) | 0% always | 0% always +# Prefix cache (OpenCode)| 85-95% when warm | expected similar or better +# VRAM usage | 52-61 GiB | 75 GiB (more KV cache) |
