# hyperstack Automates Hyperstack GPU VM lifecycle: create, bootstrap, WireGuard tunnel, vLLM inference, LiteLLM proxy. ## Architecture ``` Claude Code (local) Hyperstack VM (A100 80GB) ┌─────────────────┐ ┌──────────────────────────────────┐ │ claude CLI │── Anthropic API ─▶│ LiteLLM proxy (:4000) │ │ │ /v1/messages │ Anthropic → OpenAI translation │ │ │ via WireGuard │ │ │ └─────────────────┘ │ ▼ │ │ vLLM engine (:11434) │ OpenCode (local) │ bullpoint/Qwen3-Coder-Next- │ ┌─────────────────┐ │ AWQ-4bit (45 GB, MoE 80B) │ │ opencode │── OpenAI API ────▶│ FlashAttention v2 │ │ │ /v1/chat/... │ prefix caching │ └─────────────────┘ └──────────────────────────────────┘ ``` Both local clients connect over a WireGuard tunnel (`wg1`, subnet `192.168.3.0/24`). The VM gets `192.168.3.1`; your local machine gets `192.168.3.2`. ## Prerequisites - Hyperstack account with API key in `~/.hyperstack` - SSH key registered in Hyperstack as `earth` (or change `ssh.hyperstack_key_name` in the TOML) - WireGuard setup script: `wg1-setup.sh` (present in this directory) - Ruby with `toml-rb` gem: `bundle install` ## Quickstart ```bash # Deploy VM, set up WireGuard + vLLM + LiteLLM (~10 min on first run) ruby hyperstack.rb create # Verify everything is working ruby hyperstack.rb test # Use Claude Code against the local vLLM ANTHROPIC_BASE_URL=http://hyperstack.wg1:4000 \ ANTHROPIC_API_KEY=sk-litellm-master \ claude --model claude-opus-4-6-20260604 --dangerously-skip-permissions # Tear down ruby hyperstack.rb delete ``` ## Using Claude Code with vLLM WireGuard (`wg1`) must be active before connecting. ```bash ANTHROPIC_BASE_URL=http://hyperstack.wg1:4000 \ ANTHROPIC_API_KEY=sk-litellm-master \ claude --model claude-opus-4-6-20260604 --dangerously-skip-permissions ``` If you see an **"Auth conflict"** warning, clear the saved claude.ai session first: ```bash claude /logout ``` **Fish shell alias** (add to `~/.config/fish/config.fish`): ```fish alias claude-local='ANTHROPIC_BASE_URL=http://hyperstack.wg1:4000 \ ANTHROPIC_API_KEY=sk-litellm-master \ claude --model claude-opus-4-6-20260604 --dangerously-skip-permissions' ``` **Available model aliases** — all map to the same vLLM model: | Alias | Use case | |-------|----------| | `claude-opus-4-6-20260604` | Recommended (most future-proof) | | `claude-opus-4-20250514` | | | `claude-sonnet-4-20250514` | | | `claude-haiku-3-5-20241022` | | Add new Anthropic model IDs to `vllm.litellm_claude_model_names` in `hyperstack-vm.toml` as they are released. ## Using OpenCode with vLLM OpenCode speaks OpenAI natively — connect directly to vLLM, no LiteLLM needed: ```bash OPENAI_BASE_URL=http://hyperstack.wg1:11434/v1 \ OPENAI_API_KEY=EMPTY \ opencode ``` Set the model name to `bullpoint/Qwen3-Coder-Next-AWQ-4bit` in your OpenCode config. ## CLI reference ``` ruby hyperstack.rb [--config path] [options] Commands: create Deploy a new VM and run full provisioning delete Destroy the tracked VM status Show VM and WireGuard status test Run end-to-end inference tests (vLLM + LiteLLM) create options: --replace Delete existing tracked VM before creating --dry-run Print the plan without making changes --vllm / --no-vllm Override config: enable/disable vLLM+LiteLLM setup --ollama / --no-ollama Override config: enable/disable Ollama setup ``` ## Configuration Edit `hyperstack-vm.toml` to change defaults. Key sections: | Section | Purpose | |---------|---------| | `[vm]` | Flavor, image, environment name | | `[vllm]` | Model, container settings, LiteLLM key and Claude aliases | | `[ollama]` | Ollama settings (disabled by default; set `install = true` to use instead) | | `[network]` | Ports, WireGuard subnet, allowed CIDRs | | `[wireguard]` | Auto-setup script path | ## Monitoring vLLM ```bash # Live engine stats (throughput, KV cache, prefix cache hit rate) ssh ubuntu@ 'docker logs -f vllm_qwen3 2>&1 | grep "Engine 000"' # Last 1 minute of stats ssh ubuntu@ 'docker logs --since 1m vllm_qwen3 2>&1 | grep "Engine 000"' # GPU stats (every 5 s) ssh ubuntu@ 'nvidia-smi --query-gpu=temperature.gpu,utilization.gpu,power.draw,memory.used --format=csv -l 5' # LiteLLM proxy log ssh ubuntu@ 'sudo journalctl -fu litellm' ``` Healthy baseline (A100 80GB PCIe, qwen3-coder-next AWQ 4-bit): | Metric | Expected | |--------|----------| | Prefill throughput | 5,000–11,000 tok/s | | Decode throughput | 40–99 tok/s | | KV cache usage | 2–5% for typical sessions | | Prefix cache hit (Claude Code) | 0% (expected — prompt prefix mutates each turn) | | Prefix cache hit (OpenCode) | >50% after warm-up | ## Switching models Stop the current container, start a new one with a different `--model`, then update `vllm.model` in `hyperstack-vm.toml` and re-run `ruby hyperstack.rb create` to reinstall LiteLLM with the updated config. See `vllm-setup.txt` for detailed vLLM and LiteLLM setup notes, VRAM sizing guide, and troubleshooting.