From d8575832ae0022f94cd786b15f8b88de0bf18672 Mon Sep 17 00:00:00 2001 From: Paul Buetow Date: Wed, 18 Mar 2026 09:10:14 +0200 Subject: Add vLLM + LiteLLM support; rename script; add README MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Replace Ollama (disabled by default) with vLLM Docker container + LiteLLM Anthropic-API proxy as the default inference backend - vLLM setup: pulls vllm/vllm-openai, starts container on port 11434, polls until model is loaded (up to 10 min for first 45 GB download) - LiteLLM setup: installs in Python venv, writes config mapping Claude model aliases to the vLLM model, runs as a systemd service on port 4000 - New CLI flags on `create`: --vllm/--no-vllm, --ollama/--no-ollama to override config at runtime - New `test` command: end-to-end inference test over WireGuard against vLLM (/v1/models + /v1/chat/completions) and LiteLLM (/v1/messages) - UFW rules now open both port 11434 (inference) and 4000 (LiteLLM) from the WireGuard subnet - Rename hyperstack_vm.rb → hyperstack.rb - Add README.md with quickstart, Claude Code / OpenCode usage, CLI reference, monitoring commands, and VRAM sizing notes - Add vllm-setup.txt: detailed manual setup notes and architecture docs Co-Authored-By: Claude Sonnet 4.6 (1M context) --- snippets/hyperstack/README.md | 157 ++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 157 insertions(+) create mode 100644 snippets/hyperstack/README.md (limited to 'snippets/hyperstack/README.md') diff --git a/snippets/hyperstack/README.md b/snippets/hyperstack/README.md new file mode 100644 index 0000000..e5cc7ea --- /dev/null +++ b/snippets/hyperstack/README.md @@ -0,0 +1,157 @@ +# hyperstack + +Automates Hyperstack GPU VM lifecycle: create, bootstrap, WireGuard tunnel, vLLM inference, LiteLLM proxy. + +## Architecture + +``` +Claude Code (local) Hyperstack VM (A100 80GB) +┌─────────────────┐ ┌──────────────────────────────────┐ +│ claude CLI │── Anthropic API ─▶│ LiteLLM proxy (:4000) │ +│ │ /v1/messages │ Anthropic → OpenAI translation │ +│ │ via WireGuard │ │ │ +└─────────────────┘ │ ▼ │ + │ vLLM engine (:11434) │ +OpenCode (local) │ bullpoint/Qwen3-Coder-Next- │ +┌─────────────────┐ │ AWQ-4bit (45 GB, MoE 80B) │ +│ opencode │── OpenAI API ────▶│ FlashAttention v2 │ +│ │ /v1/chat/... │ prefix caching │ +└─────────────────┘ └──────────────────────────────────┘ +``` + +Both local clients connect over a WireGuard tunnel (`wg1`, subnet `192.168.3.0/24`). +The VM gets `192.168.3.1`; your local machine gets `192.168.3.2`. + +## Prerequisites + +- Hyperstack account with API key in `~/.hyperstack` +- SSH key registered in Hyperstack as `earth` (or change `ssh.hyperstack_key_name` in the TOML) +- WireGuard setup script: `wg1-setup.sh` (present in this directory) +- Ruby with `toml-rb` gem: `bundle install` + +## Quickstart + +```bash +# Deploy VM, set up WireGuard + vLLM + LiteLLM (~10 min on first run) +ruby hyperstack.rb create + +# Verify everything is working +ruby hyperstack.rb test + +# Use Claude Code against the local vLLM +ANTHROPIC_BASE_URL=http://192.168.3.1:4000 \ +ANTHROPIC_API_KEY=sk-litellm-master \ +claude --model claude-opus-4-6-20260604 --dangerously-skip-permissions + +# Tear down +ruby hyperstack.rb delete +``` + +## Using Claude Code with vLLM + +WireGuard (`wg1`) must be active before connecting. + +```bash +ANTHROPIC_BASE_URL=http://192.168.3.1:4000 \ +ANTHROPIC_API_KEY=sk-litellm-master \ +claude --model claude-opus-4-6-20260604 --dangerously-skip-permissions +``` + +If you see an **"Auth conflict"** warning, clear the saved claude.ai session first: + +```bash +claude /logout +``` + +**Fish shell alias** (add to `~/.config/fish/config.fish`): + +```fish +alias claude-local='ANTHROPIC_BASE_URL=http://192.168.3.1:4000 \ + ANTHROPIC_API_KEY=sk-litellm-master \ + claude --model claude-opus-4-6-20260604 --dangerously-skip-permissions' +``` + +**Available model aliases** — all map to the same vLLM model: + +| Alias | Use case | +|-------|----------| +| `claude-opus-4-6-20260604` | Recommended (most future-proof) | +| `claude-opus-4-20250514` | | +| `claude-sonnet-4-20250514` | | +| `claude-haiku-3-5-20241022` | | + +Add new Anthropic model IDs to `vllm.litellm_claude_model_names` in `hyperstack-vm.toml` as they are released. + +## Using OpenCode with vLLM + +OpenCode speaks OpenAI natively — connect directly to vLLM, no LiteLLM needed: + +```bash +OPENAI_BASE_URL=http://192.168.3.1:11434/v1 \ +OPENAI_API_KEY=EMPTY \ +opencode +``` + +Set the model name to `bullpoint/Qwen3-Coder-Next-AWQ-4bit` in your OpenCode config. + +## CLI reference + +``` +ruby hyperstack.rb [--config path] [options] + +Commands: + create Deploy a new VM and run full provisioning + delete Destroy the tracked VM + status Show VM and WireGuard status + test Run end-to-end inference tests (vLLM + LiteLLM) + +create options: + --replace Delete existing tracked VM before creating + --dry-run Print the plan without making changes + --vllm / --no-vllm Override config: enable/disable vLLM+LiteLLM setup + --ollama / --no-ollama Override config: enable/disable Ollama setup +``` + +## Configuration + +Edit `hyperstack-vm.toml` to change defaults. Key sections: + +| Section | Purpose | +|---------|---------| +| `[vm]` | Flavor, image, environment name | +| `[vllm]` | Model, container settings, LiteLLM key and Claude aliases | +| `[ollama]` | Ollama settings (disabled by default; set `install = true` to use instead) | +| `[network]` | Ports, WireGuard subnet, allowed CIDRs | +| `[wireguard]` | Auto-setup script path | + +## Monitoring vLLM + +```bash +# Live engine stats (throughput, KV cache, prefix cache hit rate) +ssh ubuntu@ 'docker logs -f vllm_qwen3 2>&1 | grep "Engine 000"' + +# Last 1 minute of stats +ssh ubuntu@ 'docker logs --since 1m vllm_qwen3 2>&1 | grep "Engine 000"' + +# GPU stats (every 5 s) +ssh ubuntu@ 'nvidia-smi --query-gpu=temperature.gpu,utilization.gpu,power.draw,memory.used --format=csv -l 5' + +# LiteLLM proxy log +ssh ubuntu@ 'sudo journalctl -fu litellm' +``` + +Healthy baseline (A100 80GB PCIe, qwen3-coder-next AWQ 4-bit): + +| Metric | Expected | +|--------|----------| +| Prefill throughput | 5,000–11,000 tok/s | +| Decode throughput | 40–99 tok/s | +| KV cache usage | 2–5% for typical sessions | +| Prefix cache hit (Claude Code) | 0% (expected — prompt prefix mutates each turn) | +| Prefix cache hit (OpenCode) | >50% after warm-up | + +## Switching models + +Stop the current container, start a new one with a different `--model`, then update `vllm.model` in `hyperstack-vm.toml` and re-run `ruby hyperstack.rb create` to reinstall LiteLLM with the updated config. + +See `vllm-setup.txt` for detailed vLLM and LiteLLM setup notes, VRAM sizing guide, and troubleshooting. -- cgit v1.2.3