summaryrefslogtreecommitdiff
path: root/snippets/hyperstack/vllm-setup.txt
diff options
context:
space:
mode:
authorPaul Buetow <paul@buetow.org>2026-03-18 09:10:14 +0200
committerPaul Buetow <paul@buetow.org>2026-03-18 09:10:14 +0200
commitd8575832ae0022f94cd786b15f8b88de0bf18672 (patch)
tree75872514846cfddb1434281a59b6673344023ff7 /snippets/hyperstack/vllm-setup.txt
parent8dca92ea40b191b9de367197aac7e1f882ed3d43 (diff)
Add vLLM + LiteLLM support; rename script; add README
- Replace Ollama (disabled by default) with vLLM Docker container + LiteLLM Anthropic-API proxy as the default inference backend - vLLM setup: pulls vllm/vllm-openai, starts container on port 11434, polls until model is loaded (up to 10 min for first 45 GB download) - LiteLLM setup: installs in Python venv, writes config mapping Claude model aliases to the vLLM model, runs as a systemd service on port 4000 - New CLI flags on `create`: --vllm/--no-vllm, --ollama/--no-ollama to override config at runtime - New `test` command: end-to-end inference test over WireGuard against vLLM (/v1/models + /v1/chat/completions) and LiteLLM (/v1/messages) - UFW rules now open both port 11434 (inference) and 4000 (LiteLLM) from the WireGuard subnet - Rename hyperstack_vm.rb → hyperstack.rb - Add README.md with quickstart, Claude Code / OpenCode usage, CLI reference, monitoring commands, and VRAM sizing notes - Add vllm-setup.txt: detailed manual setup notes and architecture docs Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Diffstat (limited to 'snippets/hyperstack/vllm-setup.txt')
-rw-r--r--snippets/hyperstack/vllm-setup.txt487
1 files changed, 487 insertions, 0 deletions
diff --git a/snippets/hyperstack/vllm-setup.txt b/snippets/hyperstack/vllm-setup.txt
new file mode 100644
index 0000000..9ea44a7
--- /dev/null
+++ b/snippets/hyperstack/vllm-setup.txt
@@ -0,0 +1,487 @@
+# vLLM + LiteLLM + Claude Code Setup for Hyperstack VM
+#
+# This document describes the full deployment of qwen3-coder-next (AWQ 4-bit)
+# via vLLM with a LiteLLM proxy for Claude Code compatibility.
+#
+# Architecture:
+#
+# Claude Code (earth) Hyperstack VM (A100 80GB)
+# ┌─────────────┐ ┌──────────────────────────────┐
+# │ claude CLI │── Anthropic API ──> │ LiteLLM proxy (:4000) │
+# │ │ /v1/messages │ translates Anthropic → │
+# │ │ via WireGuard wg1 │ OpenAI chat completions │
+# └─────────────┘ │ │ │
+# │ ▼ │
+# OpenCode (earth) │ vLLM engine (:11434) │
+# ┌─────────────┐ │ /v1/chat/completions │
+# │ opencode │── OpenAI API ──────> │ FlashAttention v2 │
+# │ │ /v1/chat/completions│ prefix caching │
+# └─────────────┘ │ bullpoint/Qwen3-Coder- │
+# │ Next-AWQ-4bit (45GB) │
+# └──────────────────────────────┘
+#
+# Why vLLM instead of Ollama:
+# - FlashAttention v2: ~1.5-2x faster prefill for long prompts
+# - Block-level prefix caching: partial KV cache reuse even when prompt
+# changes mid-sequence (Ollama requires exact prefix match from token 0)
+# - Chunked prefill: can interleave prefill and decode
+# - Marlin kernels for AWQ MoE quantization
+#
+# Why LiteLLM:
+# - Claude Code speaks Anthropic Messages API (/v1/messages) only
+# - vLLM speaks OpenAI Chat Completions API (/v1/chat/completions) only
+# - LiteLLM translates between them, mapping Claude model names to the
+# actual vLLM model
+#
+# Model details:
+# - Name: bullpoint/Qwen3-Coder-Next-AWQ-4bit (HuggingFace)
+# - Architecture: MoE, 80B total params, 3B active per token
+# - 512 experts, 10 activated + 1 shared per token
+# - Hybrid attention: Gated DeltaNet + Gated Attention (48 layers)
+# - Quantization: AWQ 4-bit, group size 32
+# - Disk size: ~45GB (vs ~151GB at BF16)
+# - VRAM usage: ~45GB weights + ~27GB KV cache at 92% utilization
+# - Context: 262,144 tokens (256k native)
+# - vLLM requirement: >= 0.15.0
+#
+# Hardware requirements:
+# - Minimum: 1x A100 80GB (PCIe or SXM)
+# - VRAM breakdown at gpu_memory_utilization=0.92:
+# Model weights: ~45 GiB
+# KV cache: ~27 GiB (298k tokens capacity, 4.49x concurrency at 262k)
+# CUDA graphs: ~3 GiB
+# Total: ~75 GiB / 80 GiB
+#
+# Ports:
+# 11434/tcp - vLLM OpenAI-compatible API (reuses Ollama port for firewall compat)
+# 4000/tcp - LiteLLM Anthropic-compatible proxy
+# Both restricted to 192.168.3.0/24 (WireGuard wg1 subnet)
+
+# ===========================================================================
+# STEP 1: Prerequisites
+# ===========================================================================
+# - VM with NVIDIA GPU, CUDA drivers, and Docker with nvidia-container-toolkit
+# - WireGuard wg1 tunnel already configured (see wg1-setup.sh)
+# - Ollama stopped and disabled if previously running:
+#
+# sudo systemctl stop ollama
+# sudo systemctl disable ollama
+
+# ===========================================================================
+# STEP 2: Storage setup
+# ===========================================================================
+# HuggingFace model cache on ephemeral storage (fast NVMe, survives reboots
+# on some providers but not guaranteed — model will re-download if lost).
+#
+# sudo mkdir -p /ephemeral/hug
+# sudo chmod -R 0777 /ephemeral/hug
+
+# ===========================================================================
+# STEP 3: vLLM Docker container
+# ===========================================================================
+# Pull and run vLLM. The model downloads on first start (~45GB, ~2.5 min).
+# After download, model loading takes ~65s and CUDA graph capture ~35s.
+# Total cold start: ~4-5 minutes.
+#
+# docker pull vllm/vllm-openai:latest
+#
+# docker run -d \
+# --gpus all \
+# --ipc=host \
+# --network host \
+# --name vllm_qwen3 \
+# --restart always \
+# -v /ephemeral/hug:/root/.cache/huggingface \
+# vllm/vllm-openai:latest \
+# --model bullpoint/Qwen3-Coder-Next-AWQ-4bit \
+# --tensor-parallel-size 1 \
+# --enable-auto-tool-choice \
+# --tool-call-parser qwen3_coder \
+# --enable-prefix-caching \
+# --gpu-memory-utilization 0.92 \
+# --max-model-len 262144 \
+# --host 0.0.0.0 \
+# --port 11434
+#
+# Flags explained:
+# --tensor-parallel-size 1 Single GPU (use 2/4 for multi-GPU setups)
+# --enable-auto-tool-choice Enables function/tool calling
+# --tool-call-parser qwen3_coder Parser for qwen3-coder tool format
+# --enable-prefix-caching Block-level KV cache reuse across requests
+# --gpu-memory-utilization 0.92 Use 92% of VRAM (rest for OS/overhead)
+# --max-model-len 262144 Full 256k context window
+# --port 11434 Reuse Ollama port for firewall compatibility
+#
+# Verify startup (wait for "Application startup complete"):
+# docker logs -f vllm_qwen3 2>&1 | grep -E "startup complete|Error"
+#
+# Verify model loaded:
+# curl -s http://localhost:11434/v1/models | python3 -m json.tool
+#
+# Quick inference test:
+# curl -s http://localhost:11434/v1/chat/completions \
+# -H "Content-Type: application/json" \
+# -H "Authorization: Bearer EMPTY" \
+# -d '{"model":"bullpoint/Qwen3-Coder-Next-AWQ-4bit",
+# "messages":[{"role":"user","content":"Hello"}],
+# "max_tokens":50}'
+#
+# Monitor performance (prefix cache hit rate, throughput):
+# docker logs -f vllm_qwen3 2>&1 | grep "Engine 000"
+
+# ===========================================================================
+# STEP 4: LiteLLM proxy (Anthropic API translation for Claude Code)
+# ===========================================================================
+# Install in a Python venv (Ubuntu 24.04 requires this):
+#
+# sudo apt-get install -y python3.12-venv
+# sudo mkdir -p /ephemeral/litellm-env
+# sudo chown ubuntu:ubuntu /ephemeral/litellm-env
+# python3 -m venv /ephemeral/litellm-env
+# /ephemeral/litellm-env/bin/pip install "litellm[proxy]"
+#
+# Write config file:
+#
+# sudo tee /ephemeral/litellm-config.yaml > /dev/null << "YAML"
+# model_list:
+# - model_name: "claude-sonnet-4-20250514"
+# litellm_params:
+# model: "hosted_vllm/bullpoint/Qwen3-Coder-Next-AWQ-4bit"
+# api_base: "http://localhost:11434/v1"
+# api_key: "EMPTY"
+# - model_name: "claude-opus-4-20250514"
+# litellm_params:
+# model: "hosted_vllm/bullpoint/Qwen3-Coder-Next-AWQ-4bit"
+# api_base: "http://localhost:11434/v1"
+# api_key: "EMPTY"
+# - model_name: "claude-opus-4-6-20260604"
+# litellm_params:
+# model: "hosted_vllm/bullpoint/Qwen3-Coder-Next-AWQ-4bit"
+# api_base: "http://localhost:11434/v1"
+# api_key: "EMPTY"
+# - model_name: "claude-haiku-3-5-20241022"
+# litellm_params:
+# model: "hosted_vllm/bullpoint/Qwen3-Coder-Next-AWQ-4bit"
+# api_base: "http://localhost:11434/v1"
+# api_key: "EMPTY"
+#
+# litellm_settings:
+# drop_params: true
+#
+# general_settings:
+# master_key: "sk-litellm-master"
+# YAML
+#
+# Config notes:
+# - model_name values must match what Claude Code sends (Claude model IDs)
+# - "hosted_vllm/" prefix forces LiteLLM to use /v1/chat/completions
+# (not /v1/responses which vLLM doesn't fully support for complex messages)
+# - drop_params: true — silently drops Claude-specific parameters like
+# context_management that vLLM doesn't understand
+# - master_key is the API key clients must send
+# - Add new model_name entries when Anthropic releases new model IDs
+#
+# Start LiteLLM:
+#
+# nohup /ephemeral/litellm-env/bin/litellm \
+# --config /ephemeral/litellm-config.yaml \
+# --host 0.0.0.0 \
+# --port 4000 \
+# > /ephemeral/litellm.log 2>&1 &
+#
+# Verify:
+# curl -s http://localhost:4000/v1/messages \
+# -H "Content-Type: application/json" \
+# -H "x-api-key: sk-litellm-master" \
+# -H "anthropic-version: 2023-06-01" \
+# -d '{"model":"claude-opus-4-6-20260604","max_tokens":50,
+# "messages":[{"role":"user","content":"Hello"}]}'
+#
+# For production, create a systemd service instead of nohup:
+#
+# sudo tee /etc/systemd/system/litellm.service > /dev/null << "UNIT"
+# [Unit]
+# Description=LiteLLM Proxy
+# After=network.target docker.service
+# Requires=docker.service
+#
+# [Service]
+# Type=simple
+# User=ubuntu
+# ExecStart=/ephemeral/litellm-env/bin/litellm \
+# --config /ephemeral/litellm-config.yaml \
+# --host 0.0.0.0 --port 4000
+# Restart=always
+# RestartSec=5
+#
+# [Install]
+# WantedBy=multi-user.target
+# UNIT
+#
+# sudo systemctl daemon-reload
+# sudo systemctl enable --now litellm
+
+# ===========================================================================
+# STEP 5: Firewall rules
+# ===========================================================================
+# Allow access from WireGuard subnet only:
+#
+# sudo ufw allow from 192.168.3.0/24 to any port 11434 proto tcp \
+# comment 'vLLM via wg1'
+# sudo ufw allow from 192.168.3.0/24 to any port 4000 proto tcp \
+# comment 'LiteLLM proxy via wg1'
+
+# ===========================================================================
+# STEP 6: Client configuration (on earth / local machine)
+# ===========================================================================
+#
+# --- Claude Code ---
+# Launch with environment variables pointing at LiteLLM proxy:
+#
+# ANTHROPIC_BASE_URL=http://192.168.3.1:4000 \
+# ANTHROPIC_API_KEY=sk-litellm-master \
+# claude --model claude-opus-4-6-20260604 --dangerously-skip-permissions
+#
+# Fish shell alias (add to ~/.config/fish/config.fish):
+#
+# alias claude-local='ANTHROPIC_BASE_URL=http://192.168.3.1:4000 \
+# ANTHROPIC_API_KEY=sk-litellm-master \
+# claude --model claude-opus-4-6-20260604 --dangerously-skip-permissions'
+#
+# --- OpenCode ---
+# Connects directly to vLLM (no LiteLLM needed, speaks OpenAI natively):
+#
+# OPENAI_BASE_URL=http://192.168.3.1:11434/v1 \
+# OPENAI_API_KEY=EMPTY \
+# opencode
+#
+# Model name in OpenCode config: bullpoint/Qwen3-Coder-Next-AWQ-4bit
+
+# ===========================================================================
+# STEP 7: Monitoring & troubleshooting
+# ===========================================================================
+#
+# --- Live engine stats ---
+# vLLM logs engine metrics every 10 seconds. Key fields:
+# - Avg prompt throughput: prefill speed (tokens/s), higher = faster
+# - Avg generation throughput: decode speed (tokens/s), ~40-99 on A100 PCIe
+# - GPU KV cache usage: % of KV cache memory in use (proportional to
+# active context length vs max capacity)
+# - Prefix cache hit rate: % of prompt tokens served from cache (0% for
+# Claude Code, higher for OpenCode)
+# - Running/Waiting: active and queued request counts
+#
+# Follow live (all stats):
+# docker logs -f vllm_qwen3 2>&1 | grep "Engine 000"
+#
+# Example output:
+# Engine 000: Avg prompt throughput: 5555.2 tokens/s,
+# Avg generation throughput: 49.4 tokens/s,
+# Running: 1 reqs, Waiting: 0 reqs,
+# GPU KV cache usage: 4.6%,
+# Prefix cache hit rate: 0.0%
+#
+# --- Request-level monitoring ---
+# See individual HTTP requests (method, status, duration):
+# docker logs -f vllm_qwen3 2>&1 | grep "POST"
+#
+# Example output:
+# 127.0.0.1:41864 - "POST /v1/chat/completions HTTP/1.1" 200 OK
+#
+# --- One-liner: last minute stats ---
+# Useful for periodic checks without following the log:
+# docker logs --since 1m vllm_qwen3 2>&1 | grep "Engine 000"
+#
+# --- LiteLLM proxy log ---
+# tail -f /ephemeral/litellm.log
+#
+# --- GPU hardware stats ---
+# Snapshot:
+# nvidia-smi
+#
+# Continuous (every 5 seconds):
+# nvidia-smi --query-gpu=temperature.gpu,utilization.gpu,power.draw,memory.used \
+# --format=csv -l 5
+#
+# --- Interpreting the stats ---
+#
+# Healthy baseline (A100 80GB PCIe, qwen3-coder-next AWQ 4-bit):
+# Prefill throughput: 5,000-11,000 tok/s (bursts higher during batch prefill)
+# Decode throughput: 40-99 tok/s (varies with output length per sample)
+# KV cache usage: 0-5% for short conversations, grows with context
+# (100% = 298k tokens, at which point requests queue)
+# Prefix cache hit: 0% for Claude Code (expected, it mutates prompt prefix)
+# >50% for OpenCode after a few turns
+# Temperature: 44-60C under load, <45C idle
+# Power: 70W idle, 230-240W under load, 300W max
+#
+# Warning signs:
+# - Waiting > 0 for extended periods → requests queuing, model overloaded
+# - KV cache usage near 100% → context too long, reduce --max-model-len
+# - Decode throughput < 20 tok/s sustained → possible thermal throttling
+# - Prefill throughput < 2,000 tok/s → check for CPU offload or driver issues
+#
+# Common issues:
+#
+# 1. OOM on startup with --max-model-len 262144
+# → Reduce to 131072 or 65536
+#
+# 2. "model does not exist" from vLLM
+# → Model name in LiteLLM config must exactly match HuggingFace repo name
+#
+# 3. LiteLLM returns UnsupportedParamsError
+# → Ensure drop_params: true is in litellm_settings
+#
+# 4. LiteLLM routes to /v1/responses instead of /v1/chat/completions
+# → Use "hosted_vllm/" prefix in model field, not "openai/"
+#
+# 5. Claude Code "Auth conflict" warning
+# → Run `claude /logout` first to clear the claude.ai session token,
+# then re-launch with ANTHROPIC_API_KEY=sk-litellm-master
+#
+# 6. Prefix cache hit rate stays at 0%
+# → Normal for Claude Code (it mutates the prompt prefix each turn)
+# → OpenCode should show increasing cache hit rates after a few turns
+#
+# 7. vLLM container won't start (CUDA version mismatch)
+# → Check driver version: nvidia-smi
+# → vLLM requires CUDA >= 12.x and driver >= 535
+
+# ===========================================================================
+# STEP 8: Loading / switching models
+# ===========================================================================
+#
+# vLLM serves one model per container. To switch models, stop the current
+# container and start a new one with different --model.
+#
+# --- Stop current model ---
+# docker stop vllm_qwen3
+# docker rm vllm_qwen3
+#
+# --- Run a different model ---
+# Replace --model, --name, and adjust --max-model-len and --tool-call-parser
+# as needed. The HuggingFace model downloads automatically on first start.
+#
+# Example: qwen3-coder:30b (smaller, faster, fits easily on A100 80GB)
+#
+# docker run -d \
+# --gpus all \
+# --ipc=host \
+# --network host \
+# --name vllm_qwen3_30b \
+# --restart always \
+# -v /ephemeral/hug:/root/.cache/huggingface \
+# vllm/vllm-openai:latest \
+# --model Qwen/Qwen3-Coder-30B-AWQ \
+# --tensor-parallel-size 1 \
+# --enable-auto-tool-choice \
+# --tool-call-parser qwen3_coder \
+# --enable-prefix-caching \
+# --gpu-memory-utilization 0.92 \
+# --max-model-len 131072 \
+# --host 0.0.0.0 \
+# --port 11434
+#
+# Example: full-precision model on multi-GPU (e.g. 4x H100)
+#
+# docker run -d \
+# --gpus all \
+# --ipc=host \
+# --network host \
+# --name vllm_qwen3_fp16 \
+# --restart always \
+# -v /ephemeral/hug:/root/.cache/huggingface \
+# vllm/vllm-openai:latest \
+# --model Qwen/Qwen3-Coder-Next \
+# --tensor-parallel-size 4 \
+# --enable-auto-tool-choice \
+# --tool-call-parser qwen3_coder \
+# --enable-prefix-caching \
+# --gpu-memory-utilization 0.90 \
+# --max-model-len 262144 \
+# --host 0.0.0.0 \
+# --port 11434
+#
+# --- Update LiteLLM config to match ---
+# After switching models, update the model field in litellm-config.yaml
+# to match the new HuggingFace model name:
+#
+# model: "hosted_vllm/<new-model-name>"
+#
+# Then restart LiteLLM:
+# pkill -f litellm
+# nohup /ephemeral/litellm-env/bin/litellm \
+# --config /ephemeral/litellm-config.yaml \
+# --host 0.0.0.0 --port 4000 \
+# > /ephemeral/litellm.log 2>&1 &
+#
+# --- Finding models ---
+# Search HuggingFace for vLLM-compatible quantized models:
+# https://huggingface.co/models?search=<model-name>+awq
+# https://huggingface.co/models?search=<model-name>+gptq
+#
+# Supported quantization formats in vLLM:
+# - AWQ (recommended): fast Marlin kernels, good quality
+# - GPTQ: similar to AWQ, widely available
+# - FP8: 8-bit, needs Hopper+ GPUs (H100/H200)
+# - BF16/FP16: full precision, needs more VRAM
+#
+# --- VRAM sizing guide ---
+# Rule of thumb for single A100 80GB at 92% utilization (~75 GiB usable):
+#
+# Model size (params) | AWQ 4-bit VRAM | Max context (remaining for KV)
+# ---------------------|----------------|-------------------------------
+# 7-8B | ~5 GiB | 262k+ (plenty of KV headroom)
+# 14B | ~9 GiB | 262k+ (plenty of KV headroom)
+# 30-32B | ~18 GiB | 262k (~57 GiB for KV cache)
+# 70-80B (MoE, 3B act) | ~45 GiB | 262k (~27 GiB for KV cache)
+# 70B (dense) | ~38 GiB | 131k (~37 GiB for KV cache)
+# 120B+ | won't fit | use multi-GPU or smaller quant
+#
+# If vLLM OOMs on startup, reduce --max-model-len first (halving it roughly
+# halves KV cache memory). If still OOM, reduce --gpu-memory-utilization
+# to 0.85 or try a smaller model.
+#
+# --- Verifying the new model ---
+# Check loaded model:
+# curl -s http://localhost:11434/v1/models | python3 -m json.tool
+#
+# Test inference:
+# curl -s http://localhost:11434/v1/chat/completions \
+# -H "Content-Type: application/json" \
+# -H "Authorization: Bearer EMPTY" \
+# -d '{"model":"<model-name>",
+# "messages":[{"role":"user","content":"Hello"}],
+# "max_tokens":50}'
+#
+# Test via LiteLLM (Anthropic API):
+# curl -s http://localhost:4000/v1/messages \
+# -H "Content-Type: application/json" \
+# -H "x-api-key: sk-litellm-master" \
+# -H "anthropic-version: 2023-06-01" \
+# -d '{"model":"claude-opus-4-6-20260604","max_tokens":50,
+# "messages":[{"role":"user","content":"Hello"}]}'
+
+# ===========================================================================
+# Performance characteristics (A100 80GB PCIe, single GPU)
+# ===========================================================================
+#
+# Measured on 2026-03-16 with bullpoint/Qwen3-Coder-Next-AWQ-4bit:
+#
+# vLLM prefill throughput: 5,000-11,000 tok/s (FlashAttention v2)
+# vLLM decode throughput: 40-99 tok/s (memory-bandwidth limited)
+# Per-turn latency: ~10-15s (small prompts, early conversation)
+# KV cache usage: 2-5% for typical coding sessions
+# Prefix cache hit rate: 0% (Claude Code), expected >50% (OpenCode)
+#
+# Comparison with Ollama on same hardware (A100 80GB PCIe):
+#
+# | Ollama (Q4_K_M) | vLLM (AWQ 4-bit)
+# -----------------------|-----------------------|----------------------
+# Prefill throughput | ~1,000 tok/s (est.) | 5,000-11,000 tok/s
+# Decode throughput | ~40 tok/s | 40-99 tok/s
+# Per-turn latency | ~28s (32k ctx) | ~10-15s
+# Context window | 32k (was truncating) | 262k (full, no truncation)
+# Prefix cache (Claude) | 0% always | 0% always
+# Prefix cache (OpenCode)| 85-95% when warm | expected similar or better
+# VRAM usage | 52-61 GiB | 75 GiB (more KV cache)