diff options
Diffstat (limited to 'snippets/hyperstack/vllm-setup.txt')
| -rw-r--r-- | snippets/hyperstack/vllm-setup.txt | 487 |
1 files changed, 0 insertions, 487 deletions
diff --git a/snippets/hyperstack/vllm-setup.txt b/snippets/hyperstack/vllm-setup.txt deleted file mode 100644 index 9ea44a7..0000000 --- a/snippets/hyperstack/vllm-setup.txt +++ /dev/null @@ -1,487 +0,0 @@ -# vLLM + LiteLLM + Claude Code Setup for Hyperstack VM -# -# This document describes the full deployment of qwen3-coder-next (AWQ 4-bit) -# via vLLM with a LiteLLM proxy for Claude Code compatibility. -# -# Architecture: -# -# Claude Code (earth) Hyperstack VM (A100 80GB) -# ┌─────────────┐ ┌──────────────────────────────┐ -# │ claude CLI │── Anthropic API ──> │ LiteLLM proxy (:4000) │ -# │ │ /v1/messages │ translates Anthropic → │ -# │ │ via WireGuard wg1 │ OpenAI chat completions │ -# └─────────────┘ │ │ │ -# │ ▼ │ -# OpenCode (earth) │ vLLM engine (:11434) │ -# ┌─────────────┐ │ /v1/chat/completions │ -# │ opencode │── OpenAI API ──────> │ FlashAttention v2 │ -# │ │ /v1/chat/completions│ prefix caching │ -# └─────────────┘ │ bullpoint/Qwen3-Coder- │ -# │ Next-AWQ-4bit (45GB) │ -# └──────────────────────────────┘ -# -# Why vLLM instead of Ollama: -# - FlashAttention v2: ~1.5-2x faster prefill for long prompts -# - Block-level prefix caching: partial KV cache reuse even when prompt -# changes mid-sequence (Ollama requires exact prefix match from token 0) -# - Chunked prefill: can interleave prefill and decode -# - Marlin kernels for AWQ MoE quantization -# -# Why LiteLLM: -# - Claude Code speaks Anthropic Messages API (/v1/messages) only -# - vLLM speaks OpenAI Chat Completions API (/v1/chat/completions) only -# - LiteLLM translates between them, mapping Claude model names to the -# actual vLLM model -# -# Model details: -# - Name: bullpoint/Qwen3-Coder-Next-AWQ-4bit (HuggingFace) -# - Architecture: MoE, 80B total params, 3B active per token -# - 512 experts, 10 activated + 1 shared per token -# - Hybrid attention: Gated DeltaNet + Gated Attention (48 layers) -# - Quantization: AWQ 4-bit, group size 32 -# - Disk size: ~45GB (vs ~151GB at BF16) -# - VRAM usage: ~45GB weights + ~27GB KV cache at 92% utilization -# - Context: 262,144 tokens (256k native) -# - vLLM requirement: >= 0.15.0 -# -# Hardware requirements: -# - Minimum: 1x A100 80GB (PCIe or SXM) -# - VRAM breakdown at gpu_memory_utilization=0.92: -# Model weights: ~45 GiB -# KV cache: ~27 GiB (298k tokens capacity, 4.49x concurrency at 262k) -# CUDA graphs: ~3 GiB -# Total: ~75 GiB / 80 GiB -# -# Ports: -# 11434/tcp - vLLM OpenAI-compatible API (reuses Ollama port for firewall compat) -# 4000/tcp - LiteLLM Anthropic-compatible proxy -# Both restricted to 192.168.3.0/24 (WireGuard wg1 subnet) - -# =========================================================================== -# STEP 1: Prerequisites -# =========================================================================== -# - VM with NVIDIA GPU, CUDA drivers, and Docker with nvidia-container-toolkit -# - WireGuard wg1 tunnel already configured (see wg1-setup.sh) -# - Ollama stopped and disabled if previously running: -# -# sudo systemctl stop ollama -# sudo systemctl disable ollama - -# =========================================================================== -# STEP 2: Storage setup -# =========================================================================== -# HuggingFace model cache on ephemeral storage (fast NVMe, survives reboots -# on some providers but not guaranteed — model will re-download if lost). -# -# sudo mkdir -p /ephemeral/hug -# sudo chmod -R 0777 /ephemeral/hug - -# =========================================================================== -# STEP 3: vLLM Docker container -# =========================================================================== -# Pull and run vLLM. The model downloads on first start (~45GB, ~2.5 min). -# After download, model loading takes ~65s and CUDA graph capture ~35s. -# Total cold start: ~4-5 minutes. -# -# docker pull vllm/vllm-openai:latest -# -# docker run -d \ -# --gpus all \ -# --ipc=host \ -# --network host \ -# --name vllm_qwen3 \ -# --restart always \ -# -v /ephemeral/hug:/root/.cache/huggingface \ -# vllm/vllm-openai:latest \ -# --model bullpoint/Qwen3-Coder-Next-AWQ-4bit \ -# --tensor-parallel-size 1 \ -# --enable-auto-tool-choice \ -# --tool-call-parser qwen3_coder \ -# --enable-prefix-caching \ -# --gpu-memory-utilization 0.92 \ -# --max-model-len 262144 \ -# --host 0.0.0.0 \ -# --port 11434 -# -# Flags explained: -# --tensor-parallel-size 1 Single GPU (use 2/4 for multi-GPU setups) -# --enable-auto-tool-choice Enables function/tool calling -# --tool-call-parser qwen3_coder Parser for qwen3-coder tool format -# --enable-prefix-caching Block-level KV cache reuse across requests -# --gpu-memory-utilization 0.92 Use 92% of VRAM (rest for OS/overhead) -# --max-model-len 262144 Full 256k context window -# --port 11434 Reuse Ollama port for firewall compatibility -# -# Verify startup (wait for "Application startup complete"): -# docker logs -f vllm_qwen3 2>&1 | grep -E "startup complete|Error" -# -# Verify model loaded: -# curl -s http://localhost:11434/v1/models | python3 -m json.tool -# -# Quick inference test: -# curl -s http://localhost:11434/v1/chat/completions \ -# -H "Content-Type: application/json" \ -# -H "Authorization: Bearer EMPTY" \ -# -d '{"model":"bullpoint/Qwen3-Coder-Next-AWQ-4bit", -# "messages":[{"role":"user","content":"Hello"}], -# "max_tokens":50}' -# -# Monitor performance (prefix cache hit rate, throughput): -# docker logs -f vllm_qwen3 2>&1 | grep "Engine 000" - -# =========================================================================== -# STEP 4: LiteLLM proxy (Anthropic API translation for Claude Code) -# =========================================================================== -# Install in a Python venv (Ubuntu 24.04 requires this): -# -# sudo apt-get install -y python3.12-venv -# sudo mkdir -p /ephemeral/litellm-env -# sudo chown ubuntu:ubuntu /ephemeral/litellm-env -# python3 -m venv /ephemeral/litellm-env -# /ephemeral/litellm-env/bin/pip install "litellm[proxy]" -# -# Write config file: -# -# sudo tee /ephemeral/litellm-config.yaml > /dev/null << "YAML" -# model_list: -# - model_name: "claude-sonnet-4-20250514" -# litellm_params: -# model: "hosted_vllm/bullpoint/Qwen3-Coder-Next-AWQ-4bit" -# api_base: "http://localhost:11434/v1" -# api_key: "EMPTY" -# - model_name: "claude-opus-4-20250514" -# litellm_params: -# model: "hosted_vllm/bullpoint/Qwen3-Coder-Next-AWQ-4bit" -# api_base: "http://localhost:11434/v1" -# api_key: "EMPTY" -# - model_name: "claude-opus-4-6-20260604" -# litellm_params: -# model: "hosted_vllm/bullpoint/Qwen3-Coder-Next-AWQ-4bit" -# api_base: "http://localhost:11434/v1" -# api_key: "EMPTY" -# - model_name: "claude-haiku-3-5-20241022" -# litellm_params: -# model: "hosted_vllm/bullpoint/Qwen3-Coder-Next-AWQ-4bit" -# api_base: "http://localhost:11434/v1" -# api_key: "EMPTY" -# -# litellm_settings: -# drop_params: true -# -# general_settings: -# master_key: "sk-litellm-master" -# YAML -# -# Config notes: -# - model_name values must match what Claude Code sends (Claude model IDs) -# - "hosted_vllm/" prefix forces LiteLLM to use /v1/chat/completions -# (not /v1/responses which vLLM doesn't fully support for complex messages) -# - drop_params: true — silently drops Claude-specific parameters like -# context_management that vLLM doesn't understand -# - master_key is the API key clients must send -# - Add new model_name entries when Anthropic releases new model IDs -# -# Start LiteLLM: -# -# nohup /ephemeral/litellm-env/bin/litellm \ -# --config /ephemeral/litellm-config.yaml \ -# --host 0.0.0.0 \ -# --port 4000 \ -# > /ephemeral/litellm.log 2>&1 & -# -# Verify: -# curl -s http://localhost:4000/v1/messages \ -# -H "Content-Type: application/json" \ -# -H "x-api-key: sk-litellm-master" \ -# -H "anthropic-version: 2023-06-01" \ -# -d '{"model":"claude-opus-4-6-20260604","max_tokens":50, -# "messages":[{"role":"user","content":"Hello"}]}' -# -# For production, create a systemd service instead of nohup: -# -# sudo tee /etc/systemd/system/litellm.service > /dev/null << "UNIT" -# [Unit] -# Description=LiteLLM Proxy -# After=network.target docker.service -# Requires=docker.service -# -# [Service] -# Type=simple -# User=ubuntu -# ExecStart=/ephemeral/litellm-env/bin/litellm \ -# --config /ephemeral/litellm-config.yaml \ -# --host 0.0.0.0 --port 4000 -# Restart=always -# RestartSec=5 -# -# [Install] -# WantedBy=multi-user.target -# UNIT -# -# sudo systemctl daemon-reload -# sudo systemctl enable --now litellm - -# =========================================================================== -# STEP 5: Firewall rules -# =========================================================================== -# Allow access from WireGuard subnet only: -# -# sudo ufw allow from 192.168.3.0/24 to any port 11434 proto tcp \ -# comment 'vLLM via wg1' -# sudo ufw allow from 192.168.3.0/24 to any port 4000 proto tcp \ -# comment 'LiteLLM proxy via wg1' - -# =========================================================================== -# STEP 6: Client configuration (on earth / local machine) -# =========================================================================== -# -# --- Claude Code --- -# Launch with environment variables pointing at LiteLLM proxy: -# -# ANTHROPIC_BASE_URL=http://192.168.3.1:4000 \ -# ANTHROPIC_API_KEY=sk-litellm-master \ -# claude --model claude-opus-4-6-20260604 --dangerously-skip-permissions -# -# Fish shell alias (add to ~/.config/fish/config.fish): -# -# alias claude-local='ANTHROPIC_BASE_URL=http://192.168.3.1:4000 \ -# ANTHROPIC_API_KEY=sk-litellm-master \ -# claude --model claude-opus-4-6-20260604 --dangerously-skip-permissions' -# -# --- OpenCode --- -# Connects directly to vLLM (no LiteLLM needed, speaks OpenAI natively): -# -# OPENAI_BASE_URL=http://192.168.3.1:11434/v1 \ -# OPENAI_API_KEY=EMPTY \ -# opencode -# -# Model name in OpenCode config: bullpoint/Qwen3-Coder-Next-AWQ-4bit - -# =========================================================================== -# STEP 7: Monitoring & troubleshooting -# =========================================================================== -# -# --- Live engine stats --- -# vLLM logs engine metrics every 10 seconds. Key fields: -# - Avg prompt throughput: prefill speed (tokens/s), higher = faster -# - Avg generation throughput: decode speed (tokens/s), ~40-99 on A100 PCIe -# - GPU KV cache usage: % of KV cache memory in use (proportional to -# active context length vs max capacity) -# - Prefix cache hit rate: % of prompt tokens served from cache (0% for -# Claude Code, higher for OpenCode) -# - Running/Waiting: active and queued request counts -# -# Follow live (all stats): -# docker logs -f vllm_qwen3 2>&1 | grep "Engine 000" -# -# Example output: -# Engine 000: Avg prompt throughput: 5555.2 tokens/s, -# Avg generation throughput: 49.4 tokens/s, -# Running: 1 reqs, Waiting: 0 reqs, -# GPU KV cache usage: 4.6%, -# Prefix cache hit rate: 0.0% -# -# --- Request-level monitoring --- -# See individual HTTP requests (method, status, duration): -# docker logs -f vllm_qwen3 2>&1 | grep "POST" -# -# Example output: -# 127.0.0.1:41864 - "POST /v1/chat/completions HTTP/1.1" 200 OK -# -# --- One-liner: last minute stats --- -# Useful for periodic checks without following the log: -# docker logs --since 1m vllm_qwen3 2>&1 | grep "Engine 000" -# -# --- LiteLLM proxy log --- -# tail -f /ephemeral/litellm.log -# -# --- GPU hardware stats --- -# Snapshot: -# nvidia-smi -# -# Continuous (every 5 seconds): -# nvidia-smi --query-gpu=temperature.gpu,utilization.gpu,power.draw,memory.used \ -# --format=csv -l 5 -# -# --- Interpreting the stats --- -# -# Healthy baseline (A100 80GB PCIe, qwen3-coder-next AWQ 4-bit): -# Prefill throughput: 5,000-11,000 tok/s (bursts higher during batch prefill) -# Decode throughput: 40-99 tok/s (varies with output length per sample) -# KV cache usage: 0-5% for short conversations, grows with context -# (100% = 298k tokens, at which point requests queue) -# Prefix cache hit: 0% for Claude Code (expected, it mutates prompt prefix) -# >50% for OpenCode after a few turns -# Temperature: 44-60C under load, <45C idle -# Power: 70W idle, 230-240W under load, 300W max -# -# Warning signs: -# - Waiting > 0 for extended periods → requests queuing, model overloaded -# - KV cache usage near 100% → context too long, reduce --max-model-len -# - Decode throughput < 20 tok/s sustained → possible thermal throttling -# - Prefill throughput < 2,000 tok/s → check for CPU offload or driver issues -# -# Common issues: -# -# 1. OOM on startup with --max-model-len 262144 -# → Reduce to 131072 or 65536 -# -# 2. "model does not exist" from vLLM -# → Model name in LiteLLM config must exactly match HuggingFace repo name -# -# 3. LiteLLM returns UnsupportedParamsError -# → Ensure drop_params: true is in litellm_settings -# -# 4. LiteLLM routes to /v1/responses instead of /v1/chat/completions -# → Use "hosted_vllm/" prefix in model field, not "openai/" -# -# 5. Claude Code "Auth conflict" warning -# → Run `claude /logout` first to clear the claude.ai session token, -# then re-launch with ANTHROPIC_API_KEY=sk-litellm-master -# -# 6. Prefix cache hit rate stays at 0% -# → Normal for Claude Code (it mutates the prompt prefix each turn) -# → OpenCode should show increasing cache hit rates after a few turns -# -# 7. vLLM container won't start (CUDA version mismatch) -# → Check driver version: nvidia-smi -# → vLLM requires CUDA >= 12.x and driver >= 535 - -# =========================================================================== -# STEP 8: Loading / switching models -# =========================================================================== -# -# vLLM serves one model per container. To switch models, stop the current -# container and start a new one with different --model. -# -# --- Stop current model --- -# docker stop vllm_qwen3 -# docker rm vllm_qwen3 -# -# --- Run a different model --- -# Replace --model, --name, and adjust --max-model-len and --tool-call-parser -# as needed. The HuggingFace model downloads automatically on first start. -# -# Example: qwen3-coder:30b (smaller, faster, fits easily on A100 80GB) -# -# docker run -d \ -# --gpus all \ -# --ipc=host \ -# --network host \ -# --name vllm_qwen3_30b \ -# --restart always \ -# -v /ephemeral/hug:/root/.cache/huggingface \ -# vllm/vllm-openai:latest \ -# --model Qwen/Qwen3-Coder-30B-AWQ \ -# --tensor-parallel-size 1 \ -# --enable-auto-tool-choice \ -# --tool-call-parser qwen3_coder \ -# --enable-prefix-caching \ -# --gpu-memory-utilization 0.92 \ -# --max-model-len 131072 \ -# --host 0.0.0.0 \ -# --port 11434 -# -# Example: full-precision model on multi-GPU (e.g. 4x H100) -# -# docker run -d \ -# --gpus all \ -# --ipc=host \ -# --network host \ -# --name vllm_qwen3_fp16 \ -# --restart always \ -# -v /ephemeral/hug:/root/.cache/huggingface \ -# vllm/vllm-openai:latest \ -# --model Qwen/Qwen3-Coder-Next \ -# --tensor-parallel-size 4 \ -# --enable-auto-tool-choice \ -# --tool-call-parser qwen3_coder \ -# --enable-prefix-caching \ -# --gpu-memory-utilization 0.90 \ -# --max-model-len 262144 \ -# --host 0.0.0.0 \ -# --port 11434 -# -# --- Update LiteLLM config to match --- -# After switching models, update the model field in litellm-config.yaml -# to match the new HuggingFace model name: -# -# model: "hosted_vllm/<new-model-name>" -# -# Then restart LiteLLM: -# pkill -f litellm -# nohup /ephemeral/litellm-env/bin/litellm \ -# --config /ephemeral/litellm-config.yaml \ -# --host 0.0.0.0 --port 4000 \ -# > /ephemeral/litellm.log 2>&1 & -# -# --- Finding models --- -# Search HuggingFace for vLLM-compatible quantized models: -# https://huggingface.co/models?search=<model-name>+awq -# https://huggingface.co/models?search=<model-name>+gptq -# -# Supported quantization formats in vLLM: -# - AWQ (recommended): fast Marlin kernels, good quality -# - GPTQ: similar to AWQ, widely available -# - FP8: 8-bit, needs Hopper+ GPUs (H100/H200) -# - BF16/FP16: full precision, needs more VRAM -# -# --- VRAM sizing guide --- -# Rule of thumb for single A100 80GB at 92% utilization (~75 GiB usable): -# -# Model size (params) | AWQ 4-bit VRAM | Max context (remaining for KV) -# ---------------------|----------------|------------------------------- -# 7-8B | ~5 GiB | 262k+ (plenty of KV headroom) -# 14B | ~9 GiB | 262k+ (plenty of KV headroom) -# 30-32B | ~18 GiB | 262k (~57 GiB for KV cache) -# 70-80B (MoE, 3B act) | ~45 GiB | 262k (~27 GiB for KV cache) -# 70B (dense) | ~38 GiB | 131k (~37 GiB for KV cache) -# 120B+ | won't fit | use multi-GPU or smaller quant -# -# If vLLM OOMs on startup, reduce --max-model-len first (halving it roughly -# halves KV cache memory). If still OOM, reduce --gpu-memory-utilization -# to 0.85 or try a smaller model. -# -# --- Verifying the new model --- -# Check loaded model: -# curl -s http://localhost:11434/v1/models | python3 -m json.tool -# -# Test inference: -# curl -s http://localhost:11434/v1/chat/completions \ -# -H "Content-Type: application/json" \ -# -H "Authorization: Bearer EMPTY" \ -# -d '{"model":"<model-name>", -# "messages":[{"role":"user","content":"Hello"}], -# "max_tokens":50}' -# -# Test via LiteLLM (Anthropic API): -# curl -s http://localhost:4000/v1/messages \ -# -H "Content-Type: application/json" \ -# -H "x-api-key: sk-litellm-master" \ -# -H "anthropic-version: 2023-06-01" \ -# -d '{"model":"claude-opus-4-6-20260604","max_tokens":50, -# "messages":[{"role":"user","content":"Hello"}]}' - -# =========================================================================== -# Performance characteristics (A100 80GB PCIe, single GPU) -# =========================================================================== -# -# Measured on 2026-03-16 with bullpoint/Qwen3-Coder-Next-AWQ-4bit: -# -# vLLM prefill throughput: 5,000-11,000 tok/s (FlashAttention v2) -# vLLM decode throughput: 40-99 tok/s (memory-bandwidth limited) -# Per-turn latency: ~10-15s (small prompts, early conversation) -# KV cache usage: 2-5% for typical coding sessions -# Prefix cache hit rate: 0% (Claude Code), expected >50% (OpenCode) -# -# Comparison with Ollama on same hardware (A100 80GB PCIe): -# -# | Ollama (Q4_K_M) | vLLM (AWQ 4-bit) -# -----------------------|-----------------------|---------------------- -# Prefill throughput | ~1,000 tok/s (est.) | 5,000-11,000 tok/s -# Decode throughput | ~40 tok/s | 40-99 tok/s -# Per-turn latency | ~28s (32k ctx) | ~10-15s -# Context window | 32k (was truncating) | 262k (full, no truncation) -# Prefix cache (Claude) | 0% always | 0% always -# Prefix cache (OpenCode)| 85-95% when warm | expected similar or better -# VRAM usage | 52-61 GiB | 75 GiB (more KV cache) |
