moved

author: Paul Buetow <paul@buetow.org> 2026-03-21 09:46:58 +0200
committer: Paul Buetow <paul@buetow.org> 2026-03-21 09:46:58 +0200
commit: c693f37a6115f3567cd4fcff4c256a6d20dd6fac (patch)
tree: 04e18f502616535013bab0c7c513a1aabdb9c2f2 /snippets/hyperstack/vllm-setup.txt
parent: 3f6ef419f52c3361c8914a27c7949c2c8f2be1c8 (diff)
1 files changed, 0 insertions, 487 deletions
diff --git a/snippets/hyperstack/vllm-setup.txt b/snippets/hyperstack/vllm-setup.txt
deleted file mode 100644
index 9ea44a7..0000000
--- a/snippets/hyperstack/vllm-setup.txt
+++ /dev/null
@@ -1,487 +0,0 @@
-# vLLM + LiteLLM + Claude Code Setup for Hyperstack VM
-#
-# This document describes the full deployment of qwen3-coder-next (AWQ 4-bit)
-# via vLLM with a LiteLLM proxy for Claude Code compatibility.
-#
-# Architecture:
-#
-#   Claude Code (earth)                    Hyperstack VM (A100 80GB)
-#   ┌─────────────┐                       ┌──────────────────────────────┐
-#   │ claude CLI   │── Anthropic API ──>  │ LiteLLM proxy (:4000)       │
-#   │              │   /v1/messages        │   translates Anthropic →    │
-#   │              │   via WireGuard wg1   │   OpenAI chat completions   │
-#   └─────────────┘                       │         │                    │
-#                                         │         ▼                    │
-#   OpenCode (earth)                      │ vLLM engine (:11434)        │
-#   ┌─────────────┐                       │   /v1/chat/completions      │
-#   │ opencode     │── OpenAI API ──────> │   FlashAttention v2         │
-#   │              │   /v1/chat/completions│   prefix caching            │
-#   └─────────────┘                       │   bullpoint/Qwen3-Coder-    │
-#                                         │     Next-AWQ-4bit (45GB)    │
-#                                         └──────────────────────────────┘
-#
-# Why vLLM instead of Ollama:
-#   - FlashAttention v2: ~1.5-2x faster prefill for long prompts
-#   - Block-level prefix caching: partial KV cache reuse even when prompt
-#     changes mid-sequence (Ollama requires exact prefix match from token 0)
-#   - Chunked prefill: can interleave prefill and decode
-#   - Marlin kernels for AWQ MoE quantization
-#
-# Why LiteLLM:
-#   - Claude Code speaks Anthropic Messages API (/v1/messages) only
-#   - vLLM speaks OpenAI Chat Completions API (/v1/chat/completions) only
-#   - LiteLLM translates between them, mapping Claude model names to the
-#     actual vLLM model
-#
-# Model details:
-#   - Name: bullpoint/Qwen3-Coder-Next-AWQ-4bit (HuggingFace)
-#   - Architecture: MoE, 80B total params, 3B active per token
-#   - 512 experts, 10 activated + 1 shared per token
-#   - Hybrid attention: Gated DeltaNet + Gated Attention (48 layers)
-#   - Quantization: AWQ 4-bit, group size 32
-#   - Disk size: ~45GB (vs ~151GB at BF16)
-#   - VRAM usage: ~45GB weights + ~27GB KV cache at 92% utilization
-#   - Context: 262,144 tokens (256k native)
-#   - vLLM requirement: >= 0.15.0
-#
-# Hardware requirements:
-#   - Minimum: 1x A100 80GB (PCIe or SXM)
-#   - VRAM breakdown at gpu_memory_utilization=0.92:
-#       Model weights:  ~45 GiB
-#       KV cache:       ~27 GiB (298k tokens capacity, 4.49x concurrency at 262k)
-#       CUDA graphs:    ~3 GiB
-#       Total:          ~75 GiB / 80 GiB
-#
-# Ports:
-#   11434/tcp - vLLM OpenAI-compatible API (reuses Ollama port for firewall compat)
-#   4000/tcp  - LiteLLM Anthropic-compatible proxy
-#   Both restricted to 192.168.3.0/24 (WireGuard wg1 subnet)
-
-# ===========================================================================
-# STEP 1: Prerequisites
-# ===========================================================================
-# - VM with NVIDIA GPU, CUDA drivers, and Docker with nvidia-container-toolkit
-# - WireGuard wg1 tunnel already configured (see wg1-setup.sh)
-# - Ollama stopped and disabled if previously running:
-#
-#   sudo systemctl stop ollama
-#   sudo systemctl disable ollama
-
-# ===========================================================================
-# STEP 2: Storage setup
-# ===========================================================================
-# HuggingFace model cache on ephemeral storage (fast NVMe, survives reboots
-# on some providers but not guaranteed — model will re-download if lost).
-#
-#   sudo mkdir -p /ephemeral/hug
-#   sudo chmod -R 0777 /ephemeral/hug
-
-# ===========================================================================
-# STEP 3: vLLM Docker container
-# ===========================================================================
-# Pull and run vLLM. The model downloads on first start (~45GB, ~2.5 min).
-# After download, model loading takes ~65s and CUDA graph capture ~35s.
-# Total cold start: ~4-5 minutes.
-#
-#   docker pull vllm/vllm-openai:latest
-#
-#   docker run -d \
-#     --gpus all \
-#     --ipc=host \
-#     --network host \
-#     --name vllm_qwen3 \
-#     --restart always \
-#     -v /ephemeral/hug:/root/.cache/huggingface \
-#     vllm/vllm-openai:latest \
-#     --model bullpoint/Qwen3-Coder-Next-AWQ-4bit \
-#     --tensor-parallel-size 1 \
-#     --enable-auto-tool-choice \
-#     --tool-call-parser qwen3_coder \
-#     --enable-prefix-caching \
-#     --gpu-memory-utilization 0.92 \
-#     --max-model-len 262144 \
-#     --host 0.0.0.0 \
-#     --port 11434
-#
-# Flags explained:
-#   --tensor-parallel-size 1    Single GPU (use 2/4 for multi-GPU setups)
-#   --enable-auto-tool-choice   Enables function/tool calling
-#   --tool-call-parser qwen3_coder   Parser for qwen3-coder tool format
-#   --enable-prefix-caching     Block-level KV cache reuse across requests
-#   --gpu-memory-utilization 0.92   Use 92% of VRAM (rest for OS/overhead)
-#   --max-model-len 262144      Full 256k context window
-#   --port 11434                Reuse Ollama port for firewall compatibility
-#
-# Verify startup (wait for "Application startup complete"):
-#   docker logs -f vllm_qwen3 2>&1 | grep -E "startup complete|Error"
-#
-# Verify model loaded:
-#   curl -s http://localhost:11434/v1/models | python3 -m json.tool
-#
-# Quick inference test:
-#   curl -s http://localhost:11434/v1/chat/completions \
-#     -H "Content-Type: application/json" \
-#     -H "Authorization: Bearer EMPTY" \
-#     -d '{"model":"bullpoint/Qwen3-Coder-Next-AWQ-4bit",
-#          "messages":[{"role":"user","content":"Hello"}],
-#          "max_tokens":50}'
-#
-# Monitor performance (prefix cache hit rate, throughput):
-#   docker logs -f vllm_qwen3 2>&1 | grep "Engine 000"
-
-# ===========================================================================
-# STEP 4: LiteLLM proxy (Anthropic API translation for Claude Code)
-# ===========================================================================
-# Install in a Python venv (Ubuntu 24.04 requires this):
-#
-#   sudo apt-get install -y python3.12-venv
-#   sudo mkdir -p /ephemeral/litellm-env
-#   sudo chown ubuntu:ubuntu /ephemeral/litellm-env
-#   python3 -m venv /ephemeral/litellm-env
-#   /ephemeral/litellm-env/bin/pip install "litellm[proxy]"
-#
-# Write config file:
-#
-#   sudo tee /ephemeral/litellm-config.yaml > /dev/null << "YAML"
-#   model_list:
-#     - model_name: "claude-sonnet-4-20250514"
-#       litellm_params:
-#         model: "hosted_vllm/bullpoint/Qwen3-Coder-Next-AWQ-4bit"
-#         api_base: "http://localhost:11434/v1"
-#         api_key: "EMPTY"
-#     - model_name: "claude-opus-4-20250514"
-#       litellm_params:
-#         model: "hosted_vllm/bullpoint/Qwen3-Coder-Next-AWQ-4bit"
-#         api_base: "http://localhost:11434/v1"
-#         api_key: "EMPTY"
-#     - model_name: "claude-opus-4-6-20260604"
-#       litellm_params:
-#         model: "hosted_vllm/bullpoint/Qwen3-Coder-Next-AWQ-4bit"
-#         api_base: "http://localhost:11434/v1"
-#         api_key: "EMPTY"
-#     - model_name: "claude-haiku-3-5-20241022"
-#       litellm_params:
-#         model: "hosted_vllm/bullpoint/Qwen3-Coder-Next-AWQ-4bit"
-#         api_base: "http://localhost:11434/v1"
-#         api_key: "EMPTY"
-#
-#   litellm_settings:
-#     drop_params: true
-#
-#   general_settings:
-#     master_key: "sk-litellm-master"
-#   YAML
-#
-# Config notes:
-#   - model_name values must match what Claude Code sends (Claude model IDs)
-#   - "hosted_vllm/" prefix forces LiteLLM to use /v1/chat/completions
-#     (not /v1/responses which vLLM doesn't fully support for complex messages)
-#   - drop_params: true — silently drops Claude-specific parameters like
-#     context_management that vLLM doesn't understand
-#   - master_key is the API key clients must send
-#   - Add new model_name entries when Anthropic releases new model IDs
-#
-# Start LiteLLM:
-#
-#   nohup /ephemeral/litellm-env/bin/litellm \
-#     --config /ephemeral/litellm-config.yaml \
-#     --host 0.0.0.0 \
-#     --port 4000 \
-#     > /ephemeral/litellm.log 2>&1 &
-#
-# Verify:
-#   curl -s http://localhost:4000/v1/messages \
-#     -H "Content-Type: application/json" \
-#     -H "x-api-key: sk-litellm-master" \
-#     -H "anthropic-version: 2023-06-01" \
-#     -d '{"model":"claude-opus-4-6-20260604","max_tokens":50,
-#          "messages":[{"role":"user","content":"Hello"}]}'
-#
-# For production, create a systemd service instead of nohup:
-#
-#   sudo tee /etc/systemd/system/litellm.service > /dev/null << "UNIT"
-#   [Unit]
-#   Description=LiteLLM Proxy
-#   After=network.target docker.service
-#   Requires=docker.service
-#
-#   [Service]
-#   Type=simple
-#   User=ubuntu
-#   ExecStart=/ephemeral/litellm-env/bin/litellm \
-#     --config /ephemeral/litellm-config.yaml \
-#     --host 0.0.0.0 --port 4000
-#   Restart=always
-#   RestartSec=5
-#
-#   [Install]
-#   WantedBy=multi-user.target
-#   UNIT
-#
-#   sudo systemctl daemon-reload
-#   sudo systemctl enable --now litellm
-
-# ===========================================================================
-# STEP 5: Firewall rules
-# ===========================================================================
-# Allow access from WireGuard subnet only:
-#
-#   sudo ufw allow from 192.168.3.0/24 to any port 11434 proto tcp \
-#     comment 'vLLM via wg1'
-#   sudo ufw allow from 192.168.3.0/24 to any port 4000 proto tcp \
-#     comment 'LiteLLM proxy via wg1'
-
-# ===========================================================================
-# STEP 6: Client configuration (on earth / local machine)
-# ===========================================================================
-#
-# --- Claude Code ---
-# Launch with environment variables pointing at LiteLLM proxy:
-#
-#   ANTHROPIC_BASE_URL=http://192.168.3.1:4000 \
-#   ANTHROPIC_API_KEY=sk-litellm-master \
-#   claude --model claude-opus-4-6-20260604 --dangerously-skip-permissions
-#
-# Fish shell alias (add to ~/.config/fish/config.fish):
-#
-#   alias claude-local='ANTHROPIC_BASE_URL=http://192.168.3.1:4000 \
-#     ANTHROPIC_API_KEY=sk-litellm-master \
-#     claude --model claude-opus-4-6-20260604 --dangerously-skip-permissions'
-#
-# --- OpenCode ---
-# Connects directly to vLLM (no LiteLLM needed, speaks OpenAI natively):
-#
-#   OPENAI_BASE_URL=http://192.168.3.1:11434/v1 \
-#   OPENAI_API_KEY=EMPTY \
-#   opencode
-#
-# Model name in OpenCode config: bullpoint/Qwen3-Coder-Next-AWQ-4bit
-
-# ===========================================================================
-# STEP 7: Monitoring & troubleshooting
-# ===========================================================================
-#
-# --- Live engine stats ---
-# vLLM logs engine metrics every 10 seconds. Key fields:
-#   - Avg prompt throughput:     prefill speed (tokens/s), higher = faster
-#   - Avg generation throughput: decode speed (tokens/s), ~40-99 on A100 PCIe
-#   - GPU KV cache usage:        % of KV cache memory in use (proportional to
-#                                 active context length vs max capacity)
-#   - Prefix cache hit rate:     % of prompt tokens served from cache (0% for
-#                                 Claude Code, higher for OpenCode)
-#   - Running/Waiting:           active and queued request counts
-#
-# Follow live (all stats):
-#   docker logs -f vllm_qwen3 2>&1 | grep "Engine 000"
-#
-# Example output:
-#   Engine 000: Avg prompt throughput: 5555.2 tokens/s,
-#               Avg generation throughput: 49.4 tokens/s,
-#               Running: 1 reqs, Waiting: 0 reqs,
-#               GPU KV cache usage: 4.6%,
-#               Prefix cache hit rate: 0.0%
-#
-# --- Request-level monitoring ---
-# See individual HTTP requests (method, status, duration):
-#   docker logs -f vllm_qwen3 2>&1 | grep "POST"
-#
-# Example output:
-#   127.0.0.1:41864 - "POST /v1/chat/completions HTTP/1.1" 200 OK
-#
-# --- One-liner: last minute stats ---
-# Useful for periodic checks without following the log:
-#   docker logs --since 1m vllm_qwen3 2>&1 | grep "Engine 000"
-#
-# --- LiteLLM proxy log ---
-#   tail -f /ephemeral/litellm.log
-#
-# --- GPU hardware stats ---
-# Snapshot:
-#   nvidia-smi
-#
-# Continuous (every 5 seconds):
-#   nvidia-smi --query-gpu=temperature.gpu,utilization.gpu,power.draw,memory.used \
-#     --format=csv -l 5
-#
-# --- Interpreting the stats ---
-#
-# Healthy baseline (A100 80GB PCIe, qwen3-coder-next AWQ 4-bit):
-#   Prefill throughput:   5,000-11,000 tok/s (bursts higher during batch prefill)
-#   Decode throughput:    40-99 tok/s (varies with output length per sample)
-#   KV cache usage:       0-5% for short conversations, grows with context
-#                         (100% = 298k tokens, at which point requests queue)
-#   Prefix cache hit:     0% for Claude Code (expected, it mutates prompt prefix)
-#                         >50% for OpenCode after a few turns
-#   Temperature:          44-60C under load, <45C idle
-#   Power:                70W idle, 230-240W under load, 300W max
-#
-# Warning signs:
-#   - Waiting > 0 for extended periods → requests queuing, model overloaded
-#   - KV cache usage near 100% → context too long, reduce --max-model-len
-#   - Decode throughput < 20 tok/s sustained → possible thermal throttling
-#   - Prefill throughput < 2,000 tok/s → check for CPU offload or driver issues
-#
-# Common issues:
-#
-# 1. OOM on startup with --max-model-len 262144
-#    → Reduce to 131072 or 65536
-#
-# 2. "model does not exist" from vLLM
-#    → Model name in LiteLLM config must exactly match HuggingFace repo name
-#
-# 3. LiteLLM returns UnsupportedParamsError
-#    → Ensure drop_params: true is in litellm_settings
-#
-# 4. LiteLLM routes to /v1/responses instead of /v1/chat/completions
-#    → Use "hosted_vllm/" prefix in model field, not "openai/"
-#
-# 5. Claude Code "Auth conflict" warning
-#    → Run `claude /logout` first to clear the claude.ai session token,
-#      then re-launch with ANTHROPIC_API_KEY=sk-litellm-master
-#
-# 6. Prefix cache hit rate stays at 0%
-#    → Normal for Claude Code (it mutates the prompt prefix each turn)
-#    → OpenCode should show increasing cache hit rates after a few turns
-#
-# 7. vLLM container won't start (CUDA version mismatch)
-#    → Check driver version: nvidia-smi
-#    → vLLM requires CUDA >= 12.x and driver >= 535
-
-# ===========================================================================
-# STEP 8: Loading / switching models
-# ===========================================================================
-#
-# vLLM serves one model per container. To switch models, stop the current
-# container and start a new one with different --model.
-#
-# --- Stop current model ---
-#   docker stop vllm_qwen3
-#   docker rm vllm_qwen3
-#
-# --- Run a different model ---
-# Replace --model, --name, and adjust --max-model-len and --tool-call-parser
-# as needed. The HuggingFace model downloads automatically on first start.
-#
-# Example: qwen3-coder:30b (smaller, faster, fits easily on A100 80GB)
-#
-#   docker run -d \
-#     --gpus all \
-#     --ipc=host \
-#     --network host \
-#     --name vllm_qwen3_30b \
-#     --restart always \
-#     -v /ephemeral/hug:/root/.cache/huggingface \
-#     vllm/vllm-openai:latest \
-#     --model Qwen/Qwen3-Coder-30B-AWQ \
-#     --tensor-parallel-size 1 \
-#     --enable-auto-tool-choice \
-#     --tool-call-parser qwen3_coder \
-#     --enable-prefix-caching \
-#     --gpu-memory-utilization 0.92 \
-#     --max-model-len 131072 \
-#     --host 0.0.0.0 \
-#     --port 11434
-#
-# Example: full-precision model on multi-GPU (e.g. 4x H100)
-#
-#   docker run -d \
-#     --gpus all \
-#     --ipc=host \
-#     --network host \
-#     --name vllm_qwen3_fp16 \
-#     --restart always \
-#     -v /ephemeral/hug:/root/.cache/huggingface \
-#     vllm/vllm-openai:latest \
-#     --model Qwen/Qwen3-Coder-Next \
-#     --tensor-parallel-size 4 \
-#     --enable-auto-tool-choice \
-#     --tool-call-parser qwen3_coder \
-#     --enable-prefix-caching \
-#     --gpu-memory-utilization 0.90 \
-#     --max-model-len 262144 \
-#     --host 0.0.0.0 \
-#     --port 11434
-#
-# --- Update LiteLLM config to match ---
-# After switching models, update the model field in litellm-config.yaml
-# to match the new HuggingFace model name:
-#
-#   model: "hosted_vllm/<new-model-name>"
-#
-# Then restart LiteLLM:
-#   pkill -f litellm
-#   nohup /ephemeral/litellm-env/bin/litellm \
-#     --config /ephemeral/litellm-config.yaml \
-#     --host 0.0.0.0 --port 4000 \
-#     > /ephemeral/litellm.log 2>&1 &
-#
-# --- Finding models ---
-# Search HuggingFace for vLLM-compatible quantized models:
-#   https://huggingface.co/models?search=<model-name>+awq
-#   https://huggingface.co/models?search=<model-name>+gptq
-#
-# Supported quantization formats in vLLM:
-#   - AWQ (recommended): fast Marlin kernels, good quality
-#   - GPTQ: similar to AWQ, widely available
-#   - FP8: 8-bit, needs Hopper+ GPUs (H100/H200)
-#   - BF16/FP16: full precision, needs more VRAM
-#
-# --- VRAM sizing guide ---
-# Rule of thumb for single A100 80GB at 92% utilization (~75 GiB usable):
-#
-#   Model size (params)  | AWQ 4-bit VRAM | Max context (remaining for KV)
-#   ---------------------|----------------|-------------------------------
-#   7-8B                 | ~5 GiB         | 262k+ (plenty of KV headroom)
-#   14B                  | ~9 GiB         | 262k+ (plenty of KV headroom)
-#   30-32B               | ~18 GiB        | 262k  (~57 GiB for KV cache)
-#   70-80B (MoE, 3B act) | ~45 GiB        | 262k  (~27 GiB for KV cache)
-#   70B (dense)          | ~38 GiB        | 131k  (~37 GiB for KV cache)
-#   120B+                | won't fit      | use multi-GPU or smaller quant
-#
-# If vLLM OOMs on startup, reduce --max-model-len first (halving it roughly
-# halves KV cache memory). If still OOM, reduce --gpu-memory-utilization
-# to 0.85 or try a smaller model.
-#
-# --- Verifying the new model ---
-# Check loaded model:
-#   curl -s http://localhost:11434/v1/models | python3 -m json.tool
-#
-# Test inference:
-#   curl -s http://localhost:11434/v1/chat/completions \
-#     -H "Content-Type: application/json" \
-#     -H "Authorization: Bearer EMPTY" \
-#     -d '{"model":"<model-name>",
-#          "messages":[{"role":"user","content":"Hello"}],
-#          "max_tokens":50}'
-#
-# Test via LiteLLM (Anthropic API):
-#   curl -s http://localhost:4000/v1/messages \
-#     -H "Content-Type: application/json" \
-#     -H "x-api-key: sk-litellm-master" \
-#     -H "anthropic-version: 2023-06-01" \
-#     -d '{"model":"claude-opus-4-6-20260604","max_tokens":50,
-#          "messages":[{"role":"user","content":"Hello"}]}'
-
-# ===========================================================================
-# Performance characteristics (A100 80GB PCIe, single GPU)
-# ===========================================================================
-#
-# Measured on 2026-03-16 with bullpoint/Qwen3-Coder-Next-AWQ-4bit:
-#
-#   vLLM prefill throughput:    5,000-11,000 tok/s (FlashAttention v2)
-#   vLLM decode throughput:     40-99 tok/s (memory-bandwidth limited)
-#   Per-turn latency:           ~10-15s (small prompts, early conversation)
-#   KV cache usage:             2-5% for typical coding sessions
-#   Prefix cache hit rate:      0% (Claude Code), expected >50% (OpenCode)
-#
-# Comparison with Ollama on same hardware (A100 80GB PCIe):
-#
-#                          | Ollama (Q4_K_M)       | vLLM (AWQ 4-bit)
-#   -----------------------|-----------------------|----------------------
-#   Prefill throughput     | ~1,000 tok/s (est.)   | 5,000-11,000 tok/s
-#   Decode throughput      | ~40 tok/s             | 40-99 tok/s
-#   Per-turn latency       | ~28s (32k ctx)        | ~10-15s
-#   Context window         | 32k (was truncating)  | 262k (full, no truncation)
-#   Prefix cache (Claude)  | 0% always             | 0% always
-#   Prefix cache (OpenCode)| 85-95% when warm      | expected similar or better
-#   VRAM usage             | 52-61 GiB             | 75 GiB (more KV cache)
author	Paul Buetow <paul@buetow.org>	2026-03-21 09:46:58 +0200
committer	Paul Buetow <paul@buetow.org>	2026-03-21 09:46:58 +0200
commit	c693f37a6115f3567cd4fcff4c256a6d20dd6fac (patch)
tree	04e18f502616535013bab0c7c513a1aabdb9c2f2 /snippets/hyperstack/vllm-setup.txt
parent	3f6ef419f52c3361c8914a27c7949c2c8f2be1c8 (diff)