From d8575832ae0022f94cd786b15f8b88de0bf18672 Mon Sep 17 00:00:00 2001
From: Paul Buetow <paul@buetow.org>
Date: Wed, 18 Mar 2026 09:10:14 +0200
Subject: Add vLLM + LiteLLM support; rename script; add README
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

- Replace Ollama (disabled by default) with vLLM Docker container +
  LiteLLM Anthropic-API proxy as the default inference backend
- vLLM setup: pulls vllm/vllm-openai, starts container on port 11434,
  polls until model is loaded (up to 10 min for first 45 GB download)
- LiteLLM setup: installs in Python venv, writes config mapping Claude
  model aliases to the vLLM model, runs as a systemd service on port 4000
- New CLI flags on `create`: --vllm/--no-vllm, --ollama/--no-ollama to
  override config at runtime
- New `test` command: end-to-end inference test over WireGuard against
  vLLM (/v1/models + /v1/chat/completions) and LiteLLM (/v1/messages)
- UFW rules now open both port 11434 (inference) and 4000 (LiteLLM)
  from the WireGuard subnet
- Rename hyperstack_vm.rb → hyperstack.rb
- Add README.md with quickstart, Claude Code / OpenCode usage, CLI
  reference, monitoring commands, and VRAM sizing notes
- Add vllm-setup.txt: detailed manual setup notes and architecture docs

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
---
 snippets/hyperstack/README.md | 157 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 157 insertions(+)
 create mode 100644 snippets/hyperstack/README.md

(limited to 'snippets/hyperstack/README.md')

diff --git a/snippets/hyperstack/README.md b/snippets/hyperstack/README.md
new file mode 100644
index 0000000..e5cc7ea
--- /dev/null
+++ b/snippets/hyperstack/README.md
@@ -0,0 +1,157 @@
+# hyperstack
+
+Automates Hyperstack GPU VM lifecycle: create, bootstrap, WireGuard tunnel, vLLM inference, LiteLLM proxy.
+
+## Architecture
+
+```
+Claude Code (local)                    Hyperstack VM (A100 80GB)
+┌─────────────────┐                   ┌──────────────────────────────────┐
+│ claude CLI       │── Anthropic API ─▶│ LiteLLM proxy (:4000)           │
+│                  │   /v1/messages    │   Anthropic → OpenAI translation │
+│                  │   via WireGuard   │             │                    │
+└─────────────────┘                   │             ▼                    │
+                                      │ vLLM engine (:11434)            │
+OpenCode (local)                      │   bullpoint/Qwen3-Coder-Next-   │
+┌─────────────────┐                   │   AWQ-4bit (45 GB, MoE 80B)     │
+│ opencode         │── OpenAI API ────▶│   FlashAttention v2             │
+│                  │   /v1/chat/...    │   prefix caching                │
+└─────────────────┘                   └──────────────────────────────────┘
+```
+
+Both local clients connect over a WireGuard tunnel (`wg1`, subnet `192.168.3.0/24`).
+The VM gets `192.168.3.1`; your local machine gets `192.168.3.2`.
+
+## Prerequisites
+
+- Hyperstack account with API key in `~/.hyperstack`
+- SSH key registered in Hyperstack as `earth` (or change `ssh.hyperstack_key_name` in the TOML)
+- WireGuard setup script: `wg1-setup.sh` (present in this directory)
+- Ruby with `toml-rb` gem: `bundle install`
+
+## Quickstart
+
+```bash
+# Deploy VM, set up WireGuard + vLLM + LiteLLM (~10 min on first run)
+ruby hyperstack.rb create
+
+# Verify everything is working
+ruby hyperstack.rb test
+
+# Use Claude Code against the local vLLM
+ANTHROPIC_BASE_URL=http://192.168.3.1:4000 \
+ANTHROPIC_API_KEY=sk-litellm-master \
+claude --model claude-opus-4-6-20260604 --dangerously-skip-permissions
+
+# Tear down
+ruby hyperstack.rb delete
+```
+
+## Using Claude Code with vLLM
+
+WireGuard (`wg1`) must be active before connecting.
+
+```bash
+ANTHROPIC_BASE_URL=http://192.168.3.1:4000 \
+ANTHROPIC_API_KEY=sk-litellm-master \
+claude --model claude-opus-4-6-20260604 --dangerously-skip-permissions
+```
+
+If you see an **"Auth conflict"** warning, clear the saved claude.ai session first:
+
+```bash
+claude /logout
+```
+
+**Fish shell alias** (add to `~/.config/fish/config.fish`):
+
+```fish
+alias claude-local='ANTHROPIC_BASE_URL=http://192.168.3.1:4000 \
+  ANTHROPIC_API_KEY=sk-litellm-master \
+  claude --model claude-opus-4-6-20260604 --dangerously-skip-permissions'
+```
+
+**Available model aliases** — all map to the same vLLM model:
+
+| Alias | Use case |
+|-------|----------|
+| `claude-opus-4-6-20260604` | Recommended (most future-proof) |
+| `claude-opus-4-20250514` | |
+| `claude-sonnet-4-20250514` | |
+| `claude-haiku-3-5-20241022` | |
+
+Add new Anthropic model IDs to `vllm.litellm_claude_model_names` in `hyperstack-vm.toml` as they are released.
+
+## Using OpenCode with vLLM
+
+OpenCode speaks OpenAI natively — connect directly to vLLM, no LiteLLM needed:
+
+```bash
+OPENAI_BASE_URL=http://192.168.3.1:11434/v1 \
+OPENAI_API_KEY=EMPTY \
+opencode
+```
+
+Set the model name to `bullpoint/Qwen3-Coder-Next-AWQ-4bit` in your OpenCode config.
+
+## CLI reference
+
+```
+ruby hyperstack.rb [--config path] <command> [options]
+
+Commands:
+  create   Deploy a new VM and run full provisioning
+  delete   Destroy the tracked VM
+  status   Show VM and WireGuard status
+  test     Run end-to-end inference tests (vLLM + LiteLLM)
+
+create options:
+  --replace          Delete existing tracked VM before creating
+  --dry-run          Print the plan without making changes
+  --vllm / --no-vllm    Override config: enable/disable vLLM+LiteLLM setup
+  --ollama / --no-ollama Override config: enable/disable Ollama setup
+```
+
+## Configuration
+
+Edit `hyperstack-vm.toml` to change defaults. Key sections:
+
+| Section | Purpose |
+|---------|---------|
+| `[vm]` | Flavor, image, environment name |
+| `[vllm]` | Model, container settings, LiteLLM key and Claude aliases |
+| `[ollama]` | Ollama settings (disabled by default; set `install = true` to use instead) |
+| `[network]` | Ports, WireGuard subnet, allowed CIDRs |
+| `[wireguard]` | Auto-setup script path |
+
+## Monitoring vLLM
+
+```bash
+# Live engine stats (throughput, KV cache, prefix cache hit rate)
+ssh ubuntu@<vm-ip> 'docker logs -f vllm_qwen3 2>&1 | grep "Engine 000"'
+
+# Last 1 minute of stats
+ssh ubuntu@<vm-ip> 'docker logs --since 1m vllm_qwen3 2>&1 | grep "Engine 000"'
+
+# GPU stats (every 5 s)
+ssh ubuntu@<vm-ip> 'nvidia-smi --query-gpu=temperature.gpu,utilization.gpu,power.draw,memory.used --format=csv -l 5'
+
+# LiteLLM proxy log
+ssh ubuntu@<vm-ip> 'sudo journalctl -fu litellm'
+```
+
+Healthy baseline (A100 80GB PCIe, qwen3-coder-next AWQ 4-bit):
+
+| Metric | Expected |
+|--------|----------|
+| Prefill throughput | 5,000–11,000 tok/s |
+| Decode throughput | 40–99 tok/s |
+| KV cache usage | 2–5% for typical sessions |
+| Prefix cache hit (Claude Code) | 0% (expected — prompt prefix mutates each turn) |
+| Prefix cache hit (OpenCode) | >50% after warm-up |
+
+## Switching models
+
+Stop the current container, start a new one with a different `--model`, then update `vllm.model` in `hyperstack-vm.toml` and re-run `ruby hyperstack.rb create` to reinstall LiteLLM with the updated config.
+
+See `vllm-setup.txt` for detailed vLLM and LiteLLM setup notes, VRAM sizing guide, and troubleshooting.
-- 
cgit v1.2.3