docs: update README for two-VM architecture and Pi coding agent setup

Amp-Thread-ID: https://ampcode.com/threads/T-019d0f66-43e6-77bc-944b-b623c1679f87 Co-authored-by: Amp <amp@ampcode.com>
author: Paul Buetow <paul@buetow.org> 2026-03-21 10:20:28 +0200
committer: Paul Buetow <paul@buetow.org> 2026-03-21 10:20:28 +0200
commit: c0a1c966c92f5e32488e22562452c2daab9ac931 (patch)
tree: e778489dfcc7660e88b4cfe4d97ef71734f89689 /README.md
parent: 8fdba30d44037a91623c7cf05da7f1e2a298c47e (diff)
1 files changed, 121 insertions, 70 deletions
diff --git a/README.md b/README.md
index 730b310..cba656c 100644
--- a/README.md
+++ b/README.md
@@ -1,26 +1,39 @@
 # hyperstack
 
 Automates Hyperstack GPU VM lifecycle: create, bootstrap, WireGuard tunnel, vLLM inference, LiteLLM proxy.
+Runs two A100 VMs concurrently — each serving a different model — with [Pi](https://pi.dev) coding agents connected to each.
 
 ## Architecture
 
 ```
-Claude Code (local)                    Hyperstack VM (A100 80GB)
-┌─────────────────┐                   ┌──────────────────────────────────┐
-│ claude CLI       │── Anthropic API ─▶│ LiteLLM proxy (:4000)           │
-│                  │   /v1/messages    │   Anthropic → OpenAI translation │
-│                  │   via WireGuard   │             │                    │
-└─────────────────┘                   │             ▼                    │
-                                      │ vLLM engine (:11434)            │
-OpenCode (local)                      │   bullpoint/Qwen3-Coder-Next-   │
-┌─────────────────┐                   │   AWQ-4bit (45 GB, MoE 80B)     │
-│ opencode         │── OpenAI API ────▶│   FlashAttention v2             │
-│                  │   /v1/chat/...    │   prefix caching                │
-└─────────────────┘                   └──────────────────────────────────┘
+                        WireGuard tunnel (wg1, 192.168.3.0/24)
+                        earth = .2 ──────────────────────────────────────────┐
+                                │                                            │
+         ┌──────────────────────┼────────────────────────────────────────────┐│
+         │                      │                                            ││
+         ▼                      ▼                                            ▼▼
+  Hyperstack VM1 (A100 80GB)         Hyperstack VM2 (A100 80GB)
+  192.168.3.1 / hyperstack1.wg1      192.168.3.3 / hyperstack2.wg1
+  ┌──────────────────────────────┐    ┌──────────────────────────────────┐
+  │ vLLM (:11434)                │    │ vLLM (:11434)                    │
+  │   Nemotron-3-Super 120B      │    │   Qwen3-Coder-Next 80B (MoE)    │
+  │   (hybrid Mamba+MoE, AWQ-4b) │    │   (AWQ-4bit)                     │
+  │                              │    │                                  │
+  │ LiteLLM (:4000)              │    │ LiteLLM (:4000)                  │
+  │   Anthropic API → OpenAI     │    │   Anthropic API → OpenAI         │
+  └──────────────────────────────┘    └──────────────────────────────────┘
+         ▲                                     ▲
+         │ OpenAI /v1/chat/completions         │ OpenAI /v1/chat/completions
+         │                                     │
+  ┌──────┴──────┐                       ┌──────┴──────┐
+  │ Pi (local)  │                       │ Pi (local)  │
+  │ ./pi-vm1    │                       │ ./pi-vm2    │
+  │ Nemotron 3  │                       │ Qwen3 Coder │
+  └─────────────┘                       └─────────────┘
 ```
 
-Both local clients connect over a WireGuard tunnel (`wg1`, subnet `192.168.3.0/24`).
-The VM gets `192.168.3.1`; your local machine gets `192.168.3.2`.
+Both VMs share a single WireGuard interface (`wg1`) on the local machine.
+Each VM runs one vLLM model and a LiteLLM proxy for Anthropic-API translation.
 
 ## Prerequisites
 
@@ -31,28 +44,31 @@ The VM gets `192.168.3.1`; your local machine gets `192.168.3.2`.
   Set explicit CIDRs or `HYPERSTACK_OPERATOR_CIDR` if you deploy from a different network.
 - WireGuard setup script: `wg1-setup.sh` (present in this directory)
 - Ruby with `toml-rb` gem: `bundle install`
+- [Pi](https://pi.dev) coding agent installed
 
-## Quickstart
+## Quickstart (two-VM setup)
 
 ```bash
-# Deploy VM, set up WireGuard + vLLM + LiteLLM (~10 min on first run)
-ruby hyperstack.rb create
+# Deploy both VMs in parallel, set up WireGuard + vLLM + LiteLLM (~10 min)
+ruby hyperstack.rb create-both
 
-# Verify everything is working
-ruby hyperstack.rb test
+# Verify both VMs are working
+ruby hyperstack.rb --config hyperstack-vm1.toml test
+ruby hyperstack.rb --config hyperstack-vm2.toml test
 
-# Use Claude Code against the local vLLM
-ANTHROPIC_BASE_URL=http://hyperstack.wg1:4000 \
-ANTHROPIC_API_KEY=sk-litellm-master \
-claude --model claude-opus-4-6-20260604 --dangerously-skip-permissions
+# Launch Pi coding agents — one per terminal
+./pi-vm1   # Nemotron-3-Super 120B on VM1
+./pi-vm2   # Qwen3-Coder-Next on VM2
 
-# Tear down
-# Also removes the tracked local wg1 peer, hostname alias, and pinned SSH host key.
-ruby hyperstack.rb delete
+# Tear down both VMs
+ruby hyperstack.rb delete-both
 ```
 
 ## Using Pi
 
+Pi is the primary coding agent frontend. Each VM has a wrapper script that launches Pi
+with the correct model routed to that VM's vLLM instance.
+
 Bring both VMs up first:
 
 ```bash
@@ -62,19 +78,78 @@ ruby hyperstack.rb create-both
 Then start one Pi session per terminal:
 
 ```bash
-./pi-vm1
-./pi-vm2
+./pi-vm1   # → hyperstack1/cyankiwi/NVIDIA-Nemotron-3-Super-120B-A12B-AWQ-4bit
+./pi-vm2   # → hyperstack2/bullpoint/Qwen3-Coder-Next-AWQ-4bit
 ```
 
 These wrappers `cd` into this repo before launching Pi, so the project-local
-settings in `.pi/settings.json` still apply.
+settings in `pi/agent/settings.json` and model definitions in `pi/agent/models.json` apply.
+
+Pi model definitions are in `pi/agent/models.json` — two providers (`hyperstack1`, `hyperstack2`)
+are configured, each pointing at its VM's vLLM endpoint over WireGuard. All model presets
+from the TOML configs are registered so you can hot-switch models within Pi using `model switch`.
+
+**Fish shell abbreviations** (see `hyperstack.fish`):
+
+```fish
+abbr pi-hyperstack-nemotron pi --model hyperstack1/cyankiwi/NVIDIA-Nemotron-3-Super-120B-A12B-AWQ-4bit
+abbr pi-hyperstack-coder    pi --model hyperstack2/bullpoint/Qwen3-Coder-Next-AWQ-4bit
+```
+
+## Single-VM setup
+
+A single VM can be deployed with the default config:
+
+```bash
+ruby hyperstack.rb create                # uses hyperstack-vm.toml
+ruby hyperstack.rb test
+ruby hyperstack.rb delete
+```
+
+## VM configuration
+
+| Config file | Default model | WireGuard IP | Hostname |
+|---|---|---|---|
+| `hyperstack-vm1.toml` | Nemotron-3-Super 120B (AWQ-4bit) | `192.168.3.1` | `hyperstack1.wg1` |
+| `hyperstack-vm2.toml` | Qwen3-Coder-Next 80B (AWQ-4bit) | `192.168.3.3` | `hyperstack2.wg1` |
+| `hyperstack-vm.toml` | Qwen3-Coder-Next (single-VM mode) | `192.168.3.1` | `hyperstack.wg1` |
+
+Each VM has independent state files so they can be managed separately:
+
+```bash
+ruby hyperstack.rb --config hyperstack-vm1.toml status
+ruby hyperstack.rb --config hyperstack-vm2.toml status
+```
+
+## Switching models
+
+Each VM has named model presets in its TOML config. Hot-switch without reprovisioning:
+
+```bash
+ruby hyperstack.rb --config hyperstack-vm1.toml model switch qwen3-coder-next
+ruby hyperstack.rb --config hyperstack-vm2.toml model switch nemotron-super
+```
+
+Available presets (both VMs share the same set):
+
+| Preset | Model | VRAM | Context |
+|---|---|---|---|
+| `nemotron-super` | Nemotron-3-Super 120B (Mamba+MoE, 12B active) | ~60 GB | 262K |
+| `qwen3-coder-next` | Qwen3-Coder-Next 80B (MoE, AWQ-4bit) | ~45 GB | 262K |
+| `gpt-oss-120b` | GPT-OSS 120B (MoE, MXFP4) | ~65 GB | 131K |
+| `gpt-oss-20b` | GPT-OSS 20B (MoE, MXFP4) | ~14 GB | 65K |
+| `qwen25-coder-32b` | Qwen2.5-Coder-32B-Instruct (AWQ) | ~18 GB | 32K |
+| `qwen3-coder-30b` | Qwen3-Coder-30B-A3B (MoE, AWQ) | ~18 GB | 65K |
+| `deepseek-r1-32b` | DeepSeek-R1-Distill-Qwen-32B (AWQ) | ~18 GB | 32K |
+| `qwen3-32b` | Qwen3-32B (AWQ) | ~18 GB | 32K |
+| `devstral` | Devstral-Small-2507 (AWQ-4bit) | ~15 GB | 32K |
 
 ## Using Claude Code with vLLM
 
 WireGuard (`wg1`) must be active before connecting.
 
 ```bash
-ANTHROPIC_BASE_URL=http://hyperstack.wg1:4000 \
+ANTHROPIC_BASE_URL=http://hyperstack1.wg1:4000 \
 ANTHROPIC_API_KEY=sk-litellm-master \
 claude --model claude-opus-4-6-20260604 --dangerously-skip-permissions
 ```
@@ -85,15 +160,7 @@ If you see an **"Auth conflict"** warning, clear the saved claude.ai session fir
 claude /logout
 ```
 
-**Fish shell alias** (add to `~/.config/fish/config.fish`):
-
-```fish
-alias claude-local='ANTHROPIC_BASE_URL=http://hyperstack.wg1:4000 \
-  ANTHROPIC_API_KEY=sk-litellm-master \
-  claude --model claude-opus-4-6-20260604 --dangerously-skip-permissions'
-```
-
-**Available model aliases** — all map to the same vLLM model:
+**Available model aliases** — all map to the same vLLM model on that VM:
 
 | Alias | Use case |
 |-------|----------|
@@ -102,19 +169,7 @@ alias claude-local='ANTHROPIC_BASE_URL=http://hyperstack.wg1:4000 \
 | `claude-sonnet-4-20250514` | |
 | `claude-haiku-3-5-20241022` | |
 
-Add new Anthropic model IDs to `vllm.litellm_claude_model_names` in `hyperstack-vm.toml` as they are released.
-
-## Using OpenCode with vLLM
-
-OpenCode speaks OpenAI natively — connect directly to vLLM, no LiteLLM needed:
-
-```bash
-OPENAI_BASE_URL=http://hyperstack.wg1:11434/v1 \
-OPENAI_API_KEY=EMPTY \
-opencode
-```
-
-Set the model name to `bullpoint/Qwen3-Coder-Next-AWQ-4bit` in your OpenCode config.
+Add new Anthropic model IDs to `vllm.litellm_claude_model_names` in the TOML as they are released.
 
 ## CLI reference
 
@@ -122,12 +177,15 @@ Set the model name to `bullpoint/Qwen3-Coder-Next-AWQ-4bit` in your OpenCode con
 ruby hyperstack.rb [--config path] <command> [options]
 
 Commands:
-  create   Deploy a new VM and run full provisioning
-  delete   Destroy the tracked VM
-  status   Show VM and WireGuard status
-  test     Run end-to-end inference tests (vLLM + LiteLLM)
-
-create options:
+  create       Deploy a new VM and run full provisioning
+  create-both  Deploy VM1 + VM2 in parallel (uses hyperstack-vm1/vm2.toml)
+  delete       Destroy the tracked VM
+  delete-both  Destroy both VM1 and VM2
+  status       Show VM and WireGuard status
+  test         Run end-to-end inference tests (vLLM + LiteLLM)
+  model switch <preset>  Hot-switch the running vLLM model
+
+create / create-both options:
   --replace          Delete existing tracked VM before creating
   --dry-run          Print the plan without making changes
   --vllm / --no-vllm    Override config: enable/disable vLLM+LiteLLM setup
@@ -136,12 +194,14 @@ create options:
 
 ## Configuration
 
-Edit `hyperstack-vm.toml` to change defaults. Key sections:
+Edit `hyperstack-vm1.toml` / `hyperstack-vm2.toml` (or `hyperstack-vm.toml` for single-VM).
+Key sections:
 
 | Section | Purpose |
 |---------|---------|
 | `[vm]` | Flavor, image, environment name |
 | `[vllm]` | Model, container settings, LiteLLM key and Claude aliases |
+| `[vllm.presets.*]` | Named model presets for hot-switching |
 | `[ollama]` | Ollama settings (disabled by default; set `install = true` to use instead) |
 | `[network]` | Ports, WireGuard subnet, allowed CIDRs |
 | `[wireguard]` | Auto-setup script path |
@@ -157,10 +217,7 @@ clear that trust file for intentional reprovisioning; unexpected host key change
 
 ```bash
 # Live engine stats (throughput, KV cache, prefix cache hit rate)
-ssh ubuntu@<vm-ip> 'docker logs -f vllm_qwen3 2>&1 | grep "Engine 000"'
-
-# Last 1 minute of stats
-ssh ubuntu@<vm-ip> 'docker logs --since 1m vllm_qwen3 2>&1 | grep "Engine 000"'
+ssh ubuntu@<vm-ip> 'docker logs -f vllm_nemotron_super 2>&1 | grep "Engine 000"'
 
 # GPU stats (every 5 s)
 ssh ubuntu@<vm-ip> 'nvidia-smi --query-gpu=temperature.gpu,utilization.gpu,power.draw,memory.used --format=csv -l 5'
@@ -169,18 +226,12 @@ ssh ubuntu@<vm-ip> 'nvidia-smi --query-gpu=temperature.gpu,utilization.gpu,power
 ssh ubuntu@<vm-ip> 'sudo journalctl -fu litellm'
 ```
 
-Healthy baseline (A100 80GB PCIe, qwen3-coder-next AWQ 4-bit):
+Healthy baseline (A100 80GB PCIe):
 
 | Metric | Expected |
 |--------|----------|
 | Prefill throughput | 5,000–11,000 tok/s |
 | Decode throughput | 40–99 tok/s |
 | KV cache usage | 2–5% for typical sessions |
-| Prefix cache hit (Claude Code) | 0% (expected — prompt prefix mutates each turn) |
-| Prefix cache hit (OpenCode) | >50% after warm-up |
-
-## Switching models
-
-Stop the current container, start a new one with a different `--model`, then update `vllm.model` in `hyperstack-vm.toml` and re-run `ruby hyperstack.rb create` to reinstall LiteLLM with the updated config.
 
 See `vllm-setup.txt` for detailed vLLM and LiteLLM setup notes, VRAM sizing guide, and troubleshooting.
author	Paul Buetow <paul@buetow.org>	2026-03-21 10:20:28 +0200
committer	Paul Buetow <paul@buetow.org>	2026-03-21 10:20:28 +0200
commit	c0a1c966c92f5e32488e22562452c2daab9ac931 (patch)
tree	e778489dfcc7660e88b4cfe4d97ef71734f89689 /README.md
parent	8fdba30d44037a91623c7cf05da7f1e2a298c47e (diff)