summaryrefslogtreecommitdiff
path: root/snippets/hyperstack/vllm-setup.txt
diff options
context:
space:
mode:
authorPaul Buetow <paul@buetow.org>2026-03-21 09:46:58 +0200
committerPaul Buetow <paul@buetow.org>2026-03-21 09:46:58 +0200
commitc693f37a6115f3567cd4fcff4c256a6d20dd6fac (patch)
tree04e18f502616535013bab0c7c513a1aabdb9c2f2 /snippets/hyperstack/vllm-setup.txt
parent3f6ef419f52c3361c8914a27c7949c2c8f2be1c8 (diff)
moved
Diffstat (limited to 'snippets/hyperstack/vllm-setup.txt')
-rw-r--r--snippets/hyperstack/vllm-setup.txt487
1 files changed, 0 insertions, 487 deletions
diff --git a/snippets/hyperstack/vllm-setup.txt b/snippets/hyperstack/vllm-setup.txt
deleted file mode 100644
index 9ea44a7..0000000
--- a/snippets/hyperstack/vllm-setup.txt
+++ /dev/null
@@ -1,487 +0,0 @@
-# vLLM + LiteLLM + Claude Code Setup for Hyperstack VM
-#
-# This document describes the full deployment of qwen3-coder-next (AWQ 4-bit)
-# via vLLM with a LiteLLM proxy for Claude Code compatibility.
-#
-# Architecture:
-#
-# Claude Code (earth) Hyperstack VM (A100 80GB)
-# ┌─────────────┐ ┌──────────────────────────────┐
-# │ claude CLI │── Anthropic API ──> │ LiteLLM proxy (:4000) │
-# │ │ /v1/messages │ translates Anthropic → │
-# │ │ via WireGuard wg1 │ OpenAI chat completions │
-# └─────────────┘ │ │ │
-# │ ▼ │
-# OpenCode (earth) │ vLLM engine (:11434) │
-# ┌─────────────┐ │ /v1/chat/completions │
-# │ opencode │── OpenAI API ──────> │ FlashAttention v2 │
-# │ │ /v1/chat/completions│ prefix caching │
-# └─────────────┘ │ bullpoint/Qwen3-Coder- │
-# │ Next-AWQ-4bit (45GB) │
-# └──────────────────────────────┘
-#
-# Why vLLM instead of Ollama:
-# - FlashAttention v2: ~1.5-2x faster prefill for long prompts
-# - Block-level prefix caching: partial KV cache reuse even when prompt
-# changes mid-sequence (Ollama requires exact prefix match from token 0)
-# - Chunked prefill: can interleave prefill and decode
-# - Marlin kernels for AWQ MoE quantization
-#
-# Why LiteLLM:
-# - Claude Code speaks Anthropic Messages API (/v1/messages) only
-# - vLLM speaks OpenAI Chat Completions API (/v1/chat/completions) only
-# - LiteLLM translates between them, mapping Claude model names to the
-# actual vLLM model
-#
-# Model details:
-# - Name: bullpoint/Qwen3-Coder-Next-AWQ-4bit (HuggingFace)
-# - Architecture: MoE, 80B total params, 3B active per token
-# - 512 experts, 10 activated + 1 shared per token
-# - Hybrid attention: Gated DeltaNet + Gated Attention (48 layers)
-# - Quantization: AWQ 4-bit, group size 32
-# - Disk size: ~45GB (vs ~151GB at BF16)
-# - VRAM usage: ~45GB weights + ~27GB KV cache at 92% utilization
-# - Context: 262,144 tokens (256k native)
-# - vLLM requirement: >= 0.15.0
-#
-# Hardware requirements:
-# - Minimum: 1x A100 80GB (PCIe or SXM)
-# - VRAM breakdown at gpu_memory_utilization=0.92:
-# Model weights: ~45 GiB
-# KV cache: ~27 GiB (298k tokens capacity, 4.49x concurrency at 262k)
-# CUDA graphs: ~3 GiB
-# Total: ~75 GiB / 80 GiB
-#
-# Ports:
-# 11434/tcp - vLLM OpenAI-compatible API (reuses Ollama port for firewall compat)
-# 4000/tcp - LiteLLM Anthropic-compatible proxy
-# Both restricted to 192.168.3.0/24 (WireGuard wg1 subnet)
-
-# ===========================================================================
-# STEP 1: Prerequisites
-# ===========================================================================
-# - VM with NVIDIA GPU, CUDA drivers, and Docker with nvidia-container-toolkit
-# - WireGuard wg1 tunnel already configured (see wg1-setup.sh)
-# - Ollama stopped and disabled if previously running:
-#
-# sudo systemctl stop ollama
-# sudo systemctl disable ollama
-
-# ===========================================================================
-# STEP 2: Storage setup
-# ===========================================================================
-# HuggingFace model cache on ephemeral storage (fast NVMe, survives reboots
-# on some providers but not guaranteed — model will re-download if lost).
-#
-# sudo mkdir -p /ephemeral/hug
-# sudo chmod -R 0777 /ephemeral/hug
-
-# ===========================================================================
-# STEP 3: vLLM Docker container
-# ===========================================================================
-# Pull and run vLLM. The model downloads on first start (~45GB, ~2.5 min).
-# After download, model loading takes ~65s and CUDA graph capture ~35s.
-# Total cold start: ~4-5 minutes.
-#
-# docker pull vllm/vllm-openai:latest
-#
-# docker run -d \
-# --gpus all \
-# --ipc=host \
-# --network host \
-# --name vllm_qwen3 \
-# --restart always \
-# -v /ephemeral/hug:/root/.cache/huggingface \
-# vllm/vllm-openai:latest \
-# --model bullpoint/Qwen3-Coder-Next-AWQ-4bit \
-# --tensor-parallel-size 1 \
-# --enable-auto-tool-choice \
-# --tool-call-parser qwen3_coder \
-# --enable-prefix-caching \
-# --gpu-memory-utilization 0.92 \
-# --max-model-len 262144 \
-# --host 0.0.0.0 \
-# --port 11434
-#
-# Flags explained:
-# --tensor-parallel-size 1 Single GPU (use 2/4 for multi-GPU setups)
-# --enable-auto-tool-choice Enables function/tool calling
-# --tool-call-parser qwen3_coder Parser for qwen3-coder tool format
-# --enable-prefix-caching Block-level KV cache reuse across requests
-# --gpu-memory-utilization 0.92 Use 92% of VRAM (rest for OS/overhead)
-# --max-model-len 262144 Full 256k context window
-# --port 11434 Reuse Ollama port for firewall compatibility
-#
-# Verify startup (wait for "Application startup complete"):
-# docker logs -f vllm_qwen3 2>&1 | grep -E "startup complete|Error"
-#
-# Verify model loaded:
-# curl -s http://localhost:11434/v1/models | python3 -m json.tool
-#
-# Quick inference test:
-# curl -s http://localhost:11434/v1/chat/completions \
-# -H "Content-Type: application/json" \
-# -H "Authorization: Bearer EMPTY" \
-# -d '{"model":"bullpoint/Qwen3-Coder-Next-AWQ-4bit",
-# "messages":[{"role":"user","content":"Hello"}],
-# "max_tokens":50}'
-#
-# Monitor performance (prefix cache hit rate, throughput):
-# docker logs -f vllm_qwen3 2>&1 | grep "Engine 000"
-
-# ===========================================================================
-# STEP 4: LiteLLM proxy (Anthropic API translation for Claude Code)
-# ===========================================================================
-# Install in a Python venv (Ubuntu 24.04 requires this):
-#
-# sudo apt-get install -y python3.12-venv
-# sudo mkdir -p /ephemeral/litellm-env
-# sudo chown ubuntu:ubuntu /ephemeral/litellm-env
-# python3 -m venv /ephemeral/litellm-env
-# /ephemeral/litellm-env/bin/pip install "litellm[proxy]"
-#
-# Write config file:
-#
-# sudo tee /ephemeral/litellm-config.yaml > /dev/null << "YAML"
-# model_list:
-# - model_name: "claude-sonnet-4-20250514"
-# litellm_params:
-# model: "hosted_vllm/bullpoint/Qwen3-Coder-Next-AWQ-4bit"
-# api_base: "http://localhost:11434/v1"
-# api_key: "EMPTY"
-# - model_name: "claude-opus-4-20250514"
-# litellm_params:
-# model: "hosted_vllm/bullpoint/Qwen3-Coder-Next-AWQ-4bit"
-# api_base: "http://localhost:11434/v1"
-# api_key: "EMPTY"
-# - model_name: "claude-opus-4-6-20260604"
-# litellm_params:
-# model: "hosted_vllm/bullpoint/Qwen3-Coder-Next-AWQ-4bit"
-# api_base: "http://localhost:11434/v1"
-# api_key: "EMPTY"
-# - model_name: "claude-haiku-3-5-20241022"
-# litellm_params:
-# model: "hosted_vllm/bullpoint/Qwen3-Coder-Next-AWQ-4bit"
-# api_base: "http://localhost:11434/v1"
-# api_key: "EMPTY"
-#
-# litellm_settings:
-# drop_params: true
-#
-# general_settings:
-# master_key: "sk-litellm-master"
-# YAML
-#
-# Config notes:
-# - model_name values must match what Claude Code sends (Claude model IDs)
-# - "hosted_vllm/" prefix forces LiteLLM to use /v1/chat/completions
-# (not /v1/responses which vLLM doesn't fully support for complex messages)
-# - drop_params: true — silently drops Claude-specific parameters like
-# context_management that vLLM doesn't understand
-# - master_key is the API key clients must send
-# - Add new model_name entries when Anthropic releases new model IDs
-#
-# Start LiteLLM:
-#
-# nohup /ephemeral/litellm-env/bin/litellm \
-# --config /ephemeral/litellm-config.yaml \
-# --host 0.0.0.0 \
-# --port 4000 \
-# > /ephemeral/litellm.log 2>&1 &
-#
-# Verify:
-# curl -s http://localhost:4000/v1/messages \
-# -H "Content-Type: application/json" \
-# -H "x-api-key: sk-litellm-master" \
-# -H "anthropic-version: 2023-06-01" \
-# -d '{"model":"claude-opus-4-6-20260604","max_tokens":50,
-# "messages":[{"role":"user","content":"Hello"}]}'
-#
-# For production, create a systemd service instead of nohup:
-#
-# sudo tee /etc/systemd/system/litellm.service > /dev/null << "UNIT"
-# [Unit]
-# Description=LiteLLM Proxy
-# After=network.target docker.service
-# Requires=docker.service
-#
-# [Service]
-# Type=simple
-# User=ubuntu
-# ExecStart=/ephemeral/litellm-env/bin/litellm \
-# --config /ephemeral/litellm-config.yaml \
-# --host 0.0.0.0 --port 4000
-# Restart=always
-# RestartSec=5
-#
-# [Install]
-# WantedBy=multi-user.target
-# UNIT
-#
-# sudo systemctl daemon-reload
-# sudo systemctl enable --now litellm
-
-# ===========================================================================
-# STEP 5: Firewall rules
-# ===========================================================================
-# Allow access from WireGuard subnet only:
-#
-# sudo ufw allow from 192.168.3.0/24 to any port 11434 proto tcp \
-# comment 'vLLM via wg1'
-# sudo ufw allow from 192.168.3.0/24 to any port 4000 proto tcp \
-# comment 'LiteLLM proxy via wg1'
-
-# ===========================================================================
-# STEP 6: Client configuration (on earth / local machine)
-# ===========================================================================
-#
-# --- Claude Code ---
-# Launch with environment variables pointing at LiteLLM proxy:
-#
-# ANTHROPIC_BASE_URL=http://192.168.3.1:4000 \
-# ANTHROPIC_API_KEY=sk-litellm-master \
-# claude --model claude-opus-4-6-20260604 --dangerously-skip-permissions
-#
-# Fish shell alias (add to ~/.config/fish/config.fish):
-#
-# alias claude-local='ANTHROPIC_BASE_URL=http://192.168.3.1:4000 \
-# ANTHROPIC_API_KEY=sk-litellm-master \
-# claude --model claude-opus-4-6-20260604 --dangerously-skip-permissions'
-#
-# --- OpenCode ---
-# Connects directly to vLLM (no LiteLLM needed, speaks OpenAI natively):
-#
-# OPENAI_BASE_URL=http://192.168.3.1:11434/v1 \
-# OPENAI_API_KEY=EMPTY \
-# opencode
-#
-# Model name in OpenCode config: bullpoint/Qwen3-Coder-Next-AWQ-4bit
-
-# ===========================================================================
-# STEP 7: Monitoring & troubleshooting
-# ===========================================================================
-#
-# --- Live engine stats ---
-# vLLM logs engine metrics every 10 seconds. Key fields:
-# - Avg prompt throughput: prefill speed (tokens/s), higher = faster
-# - Avg generation throughput: decode speed (tokens/s), ~40-99 on A100 PCIe
-# - GPU KV cache usage: % of KV cache memory in use (proportional to
-# active context length vs max capacity)
-# - Prefix cache hit rate: % of prompt tokens served from cache (0% for
-# Claude Code, higher for OpenCode)
-# - Running/Waiting: active and queued request counts
-#
-# Follow live (all stats):
-# docker logs -f vllm_qwen3 2>&1 | grep "Engine 000"
-#
-# Example output:
-# Engine 000: Avg prompt throughput: 5555.2 tokens/s,
-# Avg generation throughput: 49.4 tokens/s,
-# Running: 1 reqs, Waiting: 0 reqs,
-# GPU KV cache usage: 4.6%,
-# Prefix cache hit rate: 0.0%
-#
-# --- Request-level monitoring ---
-# See individual HTTP requests (method, status, duration):
-# docker logs -f vllm_qwen3 2>&1 | grep "POST"
-#
-# Example output:
-# 127.0.0.1:41864 - "POST /v1/chat/completions HTTP/1.1" 200 OK
-#
-# --- One-liner: last minute stats ---
-# Useful for periodic checks without following the log:
-# docker logs --since 1m vllm_qwen3 2>&1 | grep "Engine 000"
-#
-# --- LiteLLM proxy log ---
-# tail -f /ephemeral/litellm.log
-#
-# --- GPU hardware stats ---
-# Snapshot:
-# nvidia-smi
-#
-# Continuous (every 5 seconds):
-# nvidia-smi --query-gpu=temperature.gpu,utilization.gpu,power.draw,memory.used \
-# --format=csv -l 5
-#
-# --- Interpreting the stats ---
-#
-# Healthy baseline (A100 80GB PCIe, qwen3-coder-next AWQ 4-bit):
-# Prefill throughput: 5,000-11,000 tok/s (bursts higher during batch prefill)
-# Decode throughput: 40-99 tok/s (varies with output length per sample)
-# KV cache usage: 0-5% for short conversations, grows with context
-# (100% = 298k tokens, at which point requests queue)
-# Prefix cache hit: 0% for Claude Code (expected, it mutates prompt prefix)
-# >50% for OpenCode after a few turns
-# Temperature: 44-60C under load, <45C idle
-# Power: 70W idle, 230-240W under load, 300W max
-#
-# Warning signs:
-# - Waiting > 0 for extended periods → requests queuing, model overloaded
-# - KV cache usage near 100% → context too long, reduce --max-model-len
-# - Decode throughput < 20 tok/s sustained → possible thermal throttling
-# - Prefill throughput < 2,000 tok/s → check for CPU offload or driver issues
-#
-# Common issues:
-#
-# 1. OOM on startup with --max-model-len 262144
-# → Reduce to 131072 or 65536
-#
-# 2. "model does not exist" from vLLM
-# → Model name in LiteLLM config must exactly match HuggingFace repo name
-#
-# 3. LiteLLM returns UnsupportedParamsError
-# → Ensure drop_params: true is in litellm_settings
-#
-# 4. LiteLLM routes to /v1/responses instead of /v1/chat/completions
-# → Use "hosted_vllm/" prefix in model field, not "openai/"
-#
-# 5. Claude Code "Auth conflict" warning
-# → Run `claude /logout` first to clear the claude.ai session token,
-# then re-launch with ANTHROPIC_API_KEY=sk-litellm-master
-#
-# 6. Prefix cache hit rate stays at 0%
-# → Normal for Claude Code (it mutates the prompt prefix each turn)
-# → OpenCode should show increasing cache hit rates after a few turns
-#
-# 7. vLLM container won't start (CUDA version mismatch)
-# → Check driver version: nvidia-smi
-# → vLLM requires CUDA >= 12.x and driver >= 535
-
-# ===========================================================================
-# STEP 8: Loading / switching models
-# ===========================================================================
-#
-# vLLM serves one model per container. To switch models, stop the current
-# container and start a new one with different --model.
-#
-# --- Stop current model ---
-# docker stop vllm_qwen3
-# docker rm vllm_qwen3
-#
-# --- Run a different model ---
-# Replace --model, --name, and adjust --max-model-len and --tool-call-parser
-# as needed. The HuggingFace model downloads automatically on first start.
-#
-# Example: qwen3-coder:30b (smaller, faster, fits easily on A100 80GB)
-#
-# docker run -d \
-# --gpus all \
-# --ipc=host \
-# --network host \
-# --name vllm_qwen3_30b \
-# --restart always \
-# -v /ephemeral/hug:/root/.cache/huggingface \
-# vllm/vllm-openai:latest \
-# --model Qwen/Qwen3-Coder-30B-AWQ \
-# --tensor-parallel-size 1 \
-# --enable-auto-tool-choice \
-# --tool-call-parser qwen3_coder \
-# --enable-prefix-caching \
-# --gpu-memory-utilization 0.92 \
-# --max-model-len 131072 \
-# --host 0.0.0.0 \
-# --port 11434
-#
-# Example: full-precision model on multi-GPU (e.g. 4x H100)
-#
-# docker run -d \
-# --gpus all \
-# --ipc=host \
-# --network host \
-# --name vllm_qwen3_fp16 \
-# --restart always \
-# -v /ephemeral/hug:/root/.cache/huggingface \
-# vllm/vllm-openai:latest \
-# --model Qwen/Qwen3-Coder-Next \
-# --tensor-parallel-size 4 \
-# --enable-auto-tool-choice \
-# --tool-call-parser qwen3_coder \
-# --enable-prefix-caching \
-# --gpu-memory-utilization 0.90 \
-# --max-model-len 262144 \
-# --host 0.0.0.0 \
-# --port 11434
-#
-# --- Update LiteLLM config to match ---
-# After switching models, update the model field in litellm-config.yaml
-# to match the new HuggingFace model name:
-#
-# model: "hosted_vllm/<new-model-name>"
-#
-# Then restart LiteLLM:
-# pkill -f litellm
-# nohup /ephemeral/litellm-env/bin/litellm \
-# --config /ephemeral/litellm-config.yaml \
-# --host 0.0.0.0 --port 4000 \
-# > /ephemeral/litellm.log 2>&1 &
-#
-# --- Finding models ---
-# Search HuggingFace for vLLM-compatible quantized models:
-# https://huggingface.co/models?search=<model-name>+awq
-# https://huggingface.co/models?search=<model-name>+gptq
-#
-# Supported quantization formats in vLLM:
-# - AWQ (recommended): fast Marlin kernels, good quality
-# - GPTQ: similar to AWQ, widely available
-# - FP8: 8-bit, needs Hopper+ GPUs (H100/H200)
-# - BF16/FP16: full precision, needs more VRAM
-#
-# --- VRAM sizing guide ---
-# Rule of thumb for single A100 80GB at 92% utilization (~75 GiB usable):
-#
-# Model size (params) | AWQ 4-bit VRAM | Max context (remaining for KV)
-# ---------------------|----------------|-------------------------------
-# 7-8B | ~5 GiB | 262k+ (plenty of KV headroom)
-# 14B | ~9 GiB | 262k+ (plenty of KV headroom)
-# 30-32B | ~18 GiB | 262k (~57 GiB for KV cache)
-# 70-80B (MoE, 3B act) | ~45 GiB | 262k (~27 GiB for KV cache)
-# 70B (dense) | ~38 GiB | 131k (~37 GiB for KV cache)
-# 120B+ | won't fit | use multi-GPU or smaller quant
-#
-# If vLLM OOMs on startup, reduce --max-model-len first (halving it roughly
-# halves KV cache memory). If still OOM, reduce --gpu-memory-utilization
-# to 0.85 or try a smaller model.
-#
-# --- Verifying the new model ---
-# Check loaded model:
-# curl -s http://localhost:11434/v1/models | python3 -m json.tool
-#
-# Test inference:
-# curl -s http://localhost:11434/v1/chat/completions \
-# -H "Content-Type: application/json" \
-# -H "Authorization: Bearer EMPTY" \
-# -d '{"model":"<model-name>",
-# "messages":[{"role":"user","content":"Hello"}],
-# "max_tokens":50}'
-#
-# Test via LiteLLM (Anthropic API):
-# curl -s http://localhost:4000/v1/messages \
-# -H "Content-Type: application/json" \
-# -H "x-api-key: sk-litellm-master" \
-# -H "anthropic-version: 2023-06-01" \
-# -d '{"model":"claude-opus-4-6-20260604","max_tokens":50,
-# "messages":[{"role":"user","content":"Hello"}]}'
-
-# ===========================================================================
-# Performance characteristics (A100 80GB PCIe, single GPU)
-# ===========================================================================
-#
-# Measured on 2026-03-16 with bullpoint/Qwen3-Coder-Next-AWQ-4bit:
-#
-# vLLM prefill throughput: 5,000-11,000 tok/s (FlashAttention v2)
-# vLLM decode throughput: 40-99 tok/s (memory-bandwidth limited)
-# Per-turn latency: ~10-15s (small prompts, early conversation)
-# KV cache usage: 2-5% for typical coding sessions
-# Prefix cache hit rate: 0% (Claude Code), expected >50% (OpenCode)
-#
-# Comparison with Ollama on same hardware (A100 80GB PCIe):
-#
-# | Ollama (Q4_K_M) | vLLM (AWQ 4-bit)
-# -----------------------|-----------------------|----------------------
-# Prefill throughput | ~1,000 tok/s (est.) | 5,000-11,000 tok/s
-# Decode throughput | ~40 tok/s | 40-99 tok/s
-# Per-turn latency | ~28s (32k ctx) | ~10-15s
-# Context window | 32k (was truncating) | 262k (full, no truncation)
-# Prefix cache (Claude) | 0% always | 0% always
-# Prefix cache (OpenCode)| 85-95% when warm | expected similar or better
-# VRAM usage | 52-61 GiB | 75 GiB (more KV cache)