From d8575832ae0022f94cd786b15f8b88de0bf18672 Mon Sep 17 00:00:00 2001
From: Paul Buetow <paul@buetow.org>
Date: Wed, 18 Mar 2026 09:10:14 +0200
Subject: Add vLLM + LiteLLM support; rename script; add README
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

- Replace Ollama (disabled by default) with vLLM Docker container +
  LiteLLM Anthropic-API proxy as the default inference backend
- vLLM setup: pulls vllm/vllm-openai, starts container on port 11434,
  polls until model is loaded (up to 10 min for first 45 GB download)
- LiteLLM setup: installs in Python venv, writes config mapping Claude
  model aliases to the vLLM model, runs as a systemd service on port 4000
- New CLI flags on `create`: --vllm/--no-vllm, --ollama/--no-ollama to
  override config at runtime
- New `test` command: end-to-end inference test over WireGuard against
  vLLM (/v1/models + /v1/chat/completions) and LiteLLM (/v1/messages)
- UFW rules now open both port 11434 (inference) and 4000 (LiteLLM)
  from the WireGuard subnet
- Rename hyperstack_vm.rb → hyperstack.rb
- Add README.md with quickstart, Claude Code / OpenCode usage, CLI
  reference, monitoring commands, and VRAM sizing notes
- Add vllm-setup.txt: detailed manual setup notes and architecture docs

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
---
 snippets/hyperstack/README.md          |  157 +++
 snippets/hyperstack/hyperstack-vm.toml |   30 +-
 snippets/hyperstack/hyperstack.rb      | 1790 ++++++++++++++++++++++++++++++++
 snippets/hyperstack/hyperstack_vm.rb   | 1418 -------------------------
 snippets/hyperstack/vllm-setup.txt     |  487 +++++++++
 5 files changed, 2462 insertions(+), 1420 deletions(-)
 create mode 100644 snippets/hyperstack/README.md
 create mode 100644 snippets/hyperstack/hyperstack.rb
 delete mode 100644 snippets/hyperstack/hyperstack_vm.rb
 create mode 100644 snippets/hyperstack/vllm-setup.txt

(limited to 'snippets/hyperstack')

diff --git a/snippets/hyperstack/README.md b/snippets/hyperstack/README.md
new file mode 100644
index 0000000..e5cc7ea
--- /dev/null
+++ b/snippets/hyperstack/README.md
@@ -0,0 +1,157 @@
+# hyperstack
+
+Automates Hyperstack GPU VM lifecycle: create, bootstrap, WireGuard tunnel, vLLM inference, LiteLLM proxy.
+
+## Architecture
+
+```
+Claude Code (local)                    Hyperstack VM (A100 80GB)
+┌─────────────────┐                   ┌──────────────────────────────────┐
+│ claude CLI       │── Anthropic API ─▶│ LiteLLM proxy (:4000)           │
+│                  │   /v1/messages    │   Anthropic → OpenAI translation │
+│                  │   via WireGuard   │             │                    │
+└─────────────────┘                   │             ▼                    │
+                                      │ vLLM engine (:11434)            │
+OpenCode (local)                      │   bullpoint/Qwen3-Coder-Next-   │
+┌─────────────────┐                   │   AWQ-4bit (45 GB, MoE 80B)     │
+│ opencode         │── OpenAI API ────▶│   FlashAttention v2             │
+│                  │   /v1/chat/...    │   prefix caching                │
+└─────────────────┘                   └──────────────────────────────────┘
+```
+
+Both local clients connect over a WireGuard tunnel (`wg1`, subnet `192.168.3.0/24`).
+The VM gets `192.168.3.1`; your local machine gets `192.168.3.2`.
+
+## Prerequisites
+
+- Hyperstack account with API key in `~/.hyperstack`
+- SSH key registered in Hyperstack as `earth` (or change `ssh.hyperstack_key_name` in the TOML)
+- WireGuard setup script: `wg1-setup.sh` (present in this directory)
+- Ruby with `toml-rb` gem: `bundle install`
+
+## Quickstart
+
+```bash
+# Deploy VM, set up WireGuard + vLLM + LiteLLM (~10 min on first run)
+ruby hyperstack.rb create
+
+# Verify everything is working
+ruby hyperstack.rb test
+
+# Use Claude Code against the local vLLM
+ANTHROPIC_BASE_URL=http://192.168.3.1:4000 \
+ANTHROPIC_API_KEY=sk-litellm-master \
+claude --model claude-opus-4-6-20260604 --dangerously-skip-permissions
+
+# Tear down
+ruby hyperstack.rb delete
+```
+
+## Using Claude Code with vLLM
+
+WireGuard (`wg1`) must be active before connecting.
+
+```bash
+ANTHROPIC_BASE_URL=http://192.168.3.1:4000 \
+ANTHROPIC_API_KEY=sk-litellm-master \
+claude --model claude-opus-4-6-20260604 --dangerously-skip-permissions
+```
+
+If you see an **"Auth conflict"** warning, clear the saved claude.ai session first:
+
+```bash
+claude /logout
+```
+
+**Fish shell alias** (add to `~/.config/fish/config.fish`):
+
+```fish
+alias claude-local='ANTHROPIC_BASE_URL=http://192.168.3.1:4000 \
+  ANTHROPIC_API_KEY=sk-litellm-master \
+  claude --model claude-opus-4-6-20260604 --dangerously-skip-permissions'
+```
+
+**Available model aliases** — all map to the same vLLM model:
+
+| Alias | Use case |
+|-------|----------|
+| `claude-opus-4-6-20260604` | Recommended (most future-proof) |
+| `claude-opus-4-20250514` | |
+| `claude-sonnet-4-20250514` | |
+| `claude-haiku-3-5-20241022` | |
+
+Add new Anthropic model IDs to `vllm.litellm_claude_model_names` in `hyperstack-vm.toml` as they are released.
+
+## Using OpenCode with vLLM
+
+OpenCode speaks OpenAI natively — connect directly to vLLM, no LiteLLM needed:
+
+```bash
+OPENAI_BASE_URL=http://192.168.3.1:11434/v1 \
+OPENAI_API_KEY=EMPTY \
+opencode
+```
+
+Set the model name to `bullpoint/Qwen3-Coder-Next-AWQ-4bit` in your OpenCode config.
+
+## CLI reference
+
+```
+ruby hyperstack.rb [--config path] <command> [options]
+
+Commands:
+  create   Deploy a new VM and run full provisioning
+  delete   Destroy the tracked VM
+  status   Show VM and WireGuard status
+  test     Run end-to-end inference tests (vLLM + LiteLLM)
+
+create options:
+  --replace          Delete existing tracked VM before creating
+  --dry-run          Print the plan without making changes
+  --vllm / --no-vllm    Override config: enable/disable vLLM+LiteLLM setup
+  --ollama / --no-ollama Override config: enable/disable Ollama setup
+```
+
+## Configuration
+
+Edit `hyperstack-vm.toml` to change defaults. Key sections:
+
+| Section | Purpose |
+|---------|---------|
+| `[vm]` | Flavor, image, environment name |
+| `[vllm]` | Model, container settings, LiteLLM key and Claude aliases |
+| `[ollama]` | Ollama settings (disabled by default; set `install = true` to use instead) |
+| `[network]` | Ports, WireGuard subnet, allowed CIDRs |
+| `[wireguard]` | Auto-setup script path |
+
+## Monitoring vLLM
+
+```bash
+# Live engine stats (throughput, KV cache, prefix cache hit rate)
+ssh ubuntu@<vm-ip> 'docker logs -f vllm_qwen3 2>&1 | grep "Engine 000"'
+
+# Last 1 minute of stats
+ssh ubuntu@<vm-ip> 'docker logs --since 1m vllm_qwen3 2>&1 | grep "Engine 000"'
+
+# GPU stats (every 5 s)
+ssh ubuntu@<vm-ip> 'nvidia-smi --query-gpu=temperature.gpu,utilization.gpu,power.draw,memory.used --format=csv -l 5'
+
+# LiteLLM proxy log
+ssh ubuntu@<vm-ip> 'sudo journalctl -fu litellm'
+```
+
+Healthy baseline (A100 80GB PCIe, qwen3-coder-next AWQ 4-bit):
+
+| Metric | Expected |
+|--------|----------|
+| Prefill throughput | 5,000–11,000 tok/s |
+| Decode throughput | 40–99 tok/s |
+| KV cache usage | 2–5% for typical sessions |
+| Prefix cache hit (Claude Code) | 0% (expected — prompt prefix mutates each turn) |
+| Prefix cache hit (OpenCode) | >50% after warm-up |
+
+## Switching models
+
+Stop the current container, start a new one with a different `--model`, then update `vllm.model` in `hyperstack-vm.toml` and re-run `ruby hyperstack.rb create` to reinstall LiteLLM with the updated config.
+
+See `vllm-setup.txt` for detailed vLLM and LiteLLM setup notes, VRAM sizing guide, and troubleshooting.
diff --git a/snippets/hyperstack/hyperstack-vm.toml b/snippets/hyperstack/hyperstack-vm.toml
index 2d83b0f..0ea3cfc 100644
--- a/snippets/hyperstack/hyperstack-vm.toml
+++ b/snippets/hyperstack/hyperstack-vm.toml
@@ -31,7 +31,10 @@ connect_timeout_sec = 10
 [network]
 wireguard_udp_port = 56710
 wireguard_subnet = "192.168.3.0/24"
+# Port 11434 is shared by both Ollama and vLLM for firewall compatibility.
 ollama_port = 11434
+# Port 4000: LiteLLM Anthropic-API proxy (used with vLLM).
+litellm_port = 4000
 allowed_ssh_cidrs = ["0.0.0.0/0"]
 allowed_wireguard_cidrs = ["0.0.0.0/0"]
 
@@ -42,13 +45,36 @@ configure_ufw = true
 configure_ollama_host = false
 
 [ollama]
-install = true
+# Disabled in favour of vLLM; set install = true to switch back to Ollama.
+install = false
 models_dir = "/ephemeral/ollama/models"
 listen_host = "0.0.0.0:11434"
 gpu_overhead_mb = 2000
-num_parallel = 4
+num_parallel = 1
+context_length = 32768
 pull_models = ["qwen3-coder-next", "qwen3-coder:30b", "gpt-oss:20b", "gpt-oss:120b", "nemotron-3-super"]
 
+# vLLM serves one model via Docker; LiteLLM translates Anthropic API → OpenAI.
+# Use --vllm / --no-vllm CLI flags to override install at runtime.
+[vllm]
+install = true
+model = "bullpoint/Qwen3-Coder-Next-AWQ-4bit"
+# HuggingFace model cache on ephemeral NVMe (fast; survives reboots on most providers).
+hug_cache_dir = "/ephemeral/hug"
+container_name = "vllm_qwen3"
+max_model_len = 262144
+gpu_memory_utilization = 0.92
+tensor_parallel_size = 1
+tool_call_parser = "qwen3_coder"
+# LiteLLM maps each entry to the vLLM model; add new Anthropic model IDs here.
+litellm_master_key = "sk-litellm-master"
+litellm_claude_model_names = [
+  "claude-sonnet-4-20250514",
+  "claude-opus-4-20250514",
+  "claude-opus-4-6-20260604",
+  "claude-haiku-3-5-20241022"
+]
+
 [wireguard]
 auto_setup = true
 setup_script = "./wg1-setup.sh"
diff --git a/snippets/hyperstack/hyperstack.rb b/snippets/hyperstack/hyperstack.rb
new file mode 100644
index 0000000..c84d013
--- /dev/null
+++ b/snippets/hyperstack/hyperstack.rb
@@ -0,0 +1,1790 @@
+#!/usr/bin/env ruby
+# frozen_string_literal: true
+
+begin
+  require 'bundler/setup'
+rescue LoadError
+  nil
+rescue Gem::Exception => e
+  # Ruby can ship with a Bundler library version whose matching executable
+  # is not installed locally. Fall back to direct gem loading in that case.
+  raise unless e.is_a?(Gem::GemNotFoundException) || e.is_a?(Gem::LoadError)
+end
+
+require 'json'
+require 'net/http'
+require 'open3'
+require 'optparse'
+require 'ipaddr'
+require 'shellwords'
+require 'socket'
+require 'time'
+require 'timeout'
+
+begin
+  require 'toml-rb'
+rescue LoadError
+  warn "Missing dependency: toml-rb. Run `bundle install` in #{__dir__} first."
+  exit 2
+end
+
+module HyperstackVM
+  class Error < StandardError; end
+
+  class Config
+    DEFAULTS = {
+      'auth' => {
+        'api_key_file' => '~/.hyperstack'
+      },
+      'hyperstack' => {
+        'base_url' => 'https://infrahub-api.nexgencloud.com/v1'
+      },
+      'state' => {
+        'file' => '.hyperstack-vm-state.json'
+      },
+      'vm' => {
+        'name_prefix' => 'hyperstack',
+        'hostname' => 'hyperstack',
+        'flavor_name' => 'n3-A100x1',
+        'image_name' => 'Ubuntu Server 24.04 LTS R570 CUDA 12.8 with Docker',
+        'assign_floating_ip' => true,
+        'create_bootable_volume' => false,
+        'enable_port_randomization' => false,
+        'labels' => %w[gpt-oss-120b wireguard]
+      },
+      'ssh' => {
+        'username' => 'ubuntu',
+        'private_key_path' => '~/.ssh/id_rsa',
+        'hyperstack_key_name' => 'earth',
+        'port' => 22,
+        'connect_timeout_sec' => 10
+      },
+      'network' => {
+        'wireguard_udp_port' => 56_710,
+        'wireguard_subnet' => '192.168.3.0/24',
+        'ollama_port' => 11_434,  # reused by vLLM for firewall compatibility
+        'litellm_port' => 4_000,
+        'allowed_ssh_cidrs' => ['0.0.0.0/0'],
+        'allowed_wireguard_cidrs' => ['0.0.0.0/0']
+      },
+      'bootstrap' => {
+        'enable_guest_bootstrap' => true,
+        'install_wireguard' => true,
+        'configure_ufw' => true,
+        'configure_ollama_host' => false
+      },
+      'ollama' => {
+        # Disabled in favour of vLLM; set install: true to use Ollama instead.
+        'install' => false,
+        'models_dir' => '/ephemeral/ollama/models',
+        'listen_host' => '0.0.0.0:11434',
+        'gpu_overhead_mb' => 2000,
+        'num_parallel' => 1,
+        'context_length' => 32_768,
+        'pull_models' => ['qwen3-coder:30b', 'gpt-oss:20b', 'gpt-oss:120b', 'nemotron-3-super']
+      },
+      'vllm' => {
+        # vLLM serves one model via Docker; LiteLLM translates Anthropic API → OpenAI chat completions.
+        'install' => true,
+        'model' => 'bullpoint/Qwen3-Coder-Next-AWQ-4bit',
+        'hug_cache_dir' => '/ephemeral/hug',
+        'container_name' => 'vllm_qwen3',
+        'max_model_len' => 262_144,
+        'gpu_memory_utilization' => 0.92,
+        'tensor_parallel_size' => 1,
+        'tool_call_parser' => 'qwen3_coder',
+        # LiteLLM maps each Claude model alias to the vLLM model; add new Anthropic IDs here.
+        'litellm_claude_model_names' => %w[
+          claude-sonnet-4-20250514
+          claude-opus-4-20250514
+          claude-opus-4-6-20260604
+          claude-haiku-3-5-20241022
+        ],
+        'litellm_master_key' => 'sk-litellm-master'
+      },
+      'wireguard' => {
+        'auto_setup' => true,
+        'setup_script' => './wg1-setup.sh'
+      },
+      'local_client' => {
+        'check_wg1_service' => true,
+        'interface_name' => 'wg1',
+        'config_path' => '/etc/wireguard/wg1.conf'
+      }
+    }.freeze
+
+    attr_reader :path
+
+    def self.load(path)
+      expanded = File.expand_path(path)
+      raise Error, "Config file not found: #{expanded}" unless File.exist?(expanded)
+
+      raw = TomlRB.load_file(expanded)
+      new(raw, expanded)
+    rescue TomlRB::ParseError => e
+      raise Error, "Failed to parse TOML config #{expanded}: #{e.message}"
+    end
+
+    def initialize(raw, path)
+      @path = path
+      @data = deep_merge(DEFAULTS, raw || {})
+      validate!
+    end
+
+    def api_key
+      key_path = expand_path(fetch('auth', 'api_key_file'))
+      raise Error, "API key file not found: #{key_path}" unless File.exist?(key_path)
+
+      token = File.readlines(key_path, chomp: true).find { |line| !line.strip.empty? }&.strip
+      raise Error, "API key file is empty: #{key_path}" if token.nil? || token.empty?
+
+      token
+    rescue Errno::EACCES => e
+      raise Error, "Cannot read API key file #{key_path}: #{e.message}"
+    end
+
+    def api_base_url
+      fetch('hyperstack', 'base_url')
+    end
+
+    def state_file
+      expand_path(fetch('state', 'file'))
+    end
+
+    def environment_name
+      fetch('vm', 'environment_name')
+    end
+
+    def flavor_name
+      fetch('vm', 'flavor_name')
+    end
+
+    def image_name
+      fetch('vm', 'image_name')
+    end
+
+    def vm_name_prefix
+      fetch('vm', 'name_prefix')
+    end
+
+    def generated_vm_name
+      "#{vm_name_prefix}-#{Time.now.utc.strftime('%Y%m%d%H%M%S')}"
+    end
+
+    def vm_hostname
+      value = fetch('vm', 'hostname')
+      return nil if blank?(value)
+
+      value.to_s.downcase
+    end
+
+    def assign_floating_ip?
+      truthy?(fetch('vm', 'assign_floating_ip'))
+    end
+
+    def create_bootable_volume?
+      truthy?(fetch('vm', 'create_bootable_volume'))
+    end
+
+    def enable_port_randomization?
+      truthy?(fetch('vm', 'enable_port_randomization'))
+    end
+
+    def labels
+      Array(fetch('vm', 'labels')).map(&:to_s)
+    end
+
+    def user_data
+      custom = custom_user_data
+      return custom unless custom.nil? || custom.empty?
+      return nil if vm_hostname.nil?
+
+      default_hostname_cloud_init
+    rescue Errno::ENOENT => e
+      raise Error, "User data file not found: #{e.message}"
+    rescue Errno::EACCES => e
+      raise Error, "Cannot read user data file: #{e.message}"
+    end
+
+    def ssh_username
+      fetch('ssh', 'username')
+    end
+
+    def ssh_private_key_path
+      expand_path(fetch('ssh', 'private_key_path'))
+    end
+
+    def ssh_key_name
+      fetch('ssh', 'hyperstack_key_name')
+    end
+
+    def ssh_port
+      Integer(fetch('ssh', 'port'))
+    end
+
+    def ssh_connect_timeout
+      Integer(fetch('ssh', 'connect_timeout_sec'))
+    end
+
+    def wireguard_udp_port
+      Integer(fetch('network', 'wireguard_udp_port'))
+    end
+
+    def wireguard_subnet
+      fetch('network', 'wireguard_subnet')
+    end
+
+    def ollama_port
+      Integer(fetch('network', 'ollama_port'))
+    end
+
+    def litellm_port
+      Integer(fetch('network', 'litellm_port'))
+    end
+
+    # Derives the VM's WireGuard IP as the first host in the subnet (network + 1).
+    # E.g. 192.168.3.0/24 → 192.168.3.1
+    def wireguard_gateway_ip
+      base = IPAddr.new(wireguard_subnet).to_s
+      parts = base.split('.').map(&:to_i)
+      parts[-1] += 1
+      parts.join('.')
+    end
+
+    def allowed_ssh_cidrs
+      Array(fetch('network', 'allowed_ssh_cidrs')).map(&:to_s)
+    end
+
+    def allowed_wireguard_cidrs
+      Array(fetch('network', 'allowed_wireguard_cidrs')).map(&:to_s)
+    end
+
+    def guest_bootstrap_enabled?
+      truthy?(fetch('bootstrap', 'enable_guest_bootstrap'))
+    end
+
+    def install_wireguard?
+      truthy?(fetch('bootstrap', 'install_wireguard'))
+    end
+
+    def configure_ufw?
+      truthy?(fetch('bootstrap', 'configure_ufw'))
+    end
+
+    def configure_ollama_host?
+      truthy?(fetch('bootstrap', 'configure_ollama_host'))
+    end
+
+    def ollama_install_enabled?
+      truthy?(fetch('ollama', 'install'))
+    end
+
+    def ollama_models_dir
+      fetch('ollama', 'models_dir')
+    end
+
+    def ollama_listen_host
+      fetch('ollama', 'listen_host')
+    end
+
+    def ollama_gpu_overhead_mb
+      Integer(fetch('ollama', 'gpu_overhead_mb'))
+    end
+
+    def ollama_num_parallel
+      Integer(fetch('ollama', 'num_parallel'))
+    end
+
+    # Maximum context length for Ollama inference; keeps KV cache bounded
+    # on single-GPU setups to avoid slow prefill at large context sizes.
+    def ollama_context_length
+      Integer(fetch('ollama', 'context_length'))
+    end
+
+    def ollama_pull_models
+      Array(fetch('ollama', 'pull_models')).map(&:to_s)
+    end
+
+    def vllm_install_enabled?
+      truthy?(fetch('vllm', 'install'))
+    end
+
+    def vllm_model
+      fetch('vllm', 'model')
+    end
+
+    def vllm_hug_cache_dir
+      fetch('vllm', 'hug_cache_dir')
+    end
+
+    def vllm_container_name
+      fetch('vllm', 'container_name')
+    end
+
+    def vllm_max_model_len
+      Integer(fetch('vllm', 'max_model_len'))
+    end
+
+    def vllm_gpu_memory_utilization
+      Float(fetch('vllm', 'gpu_memory_utilization'))
+    end
+
+    def vllm_tensor_parallel_size
+      Integer(fetch('vllm', 'tensor_parallel_size'))
+    end
+
+    def vllm_tool_call_parser
+      fetch('vllm', 'tool_call_parser')
+    end
+
+    # Claude model aliases that LiteLLM maps to the vLLM model.
+    # Must match what Claude Code sends in the model field.
+    def litellm_claude_model_names
+      Array(fetch('vllm', 'litellm_claude_model_names')).map(&:to_s)
+    end
+
+    def litellm_master_key
+      fetch('vllm', 'litellm_master_key')
+    end
+
+    def local_client_checks_enabled?
+      truthy?(fetch('local_client', 'check_wg1_service'))
+    end
+
+    def local_interface_name
+      fetch('local_client', 'interface_name')
+    end
+
+    def local_wg_config_path
+      fetch('local_client', 'config_path')
+    end
+
+    def wireguard_auto_setup?
+      truthy?(fetch('wireguard', 'auto_setup'))
+    end
+
+    def wireguard_setup_script
+      expand_path(fetch('wireguard', 'setup_script'))
+    end
+
+    def desired_security_rules
+      rules = []
+
+      allowed_ssh_cidrs.each do |cidr|
+        rules << firewall_rule('tcp', ssh_port, cidr)
+      end
+
+      allowed_wireguard_cidrs.each do |cidr|
+        rules << firewall_rule('udp', wireguard_udp_port, cidr)
+      end
+
+      # Port 11434: shared by Ollama and vLLM (WireGuard-subnet-restricted).
+      rules << firewall_rule('tcp', ollama_port, wireguard_subnet)
+      # Port 4000: LiteLLM Anthropic-API proxy (WireGuard-subnet-restricted).
+      rules << firewall_rule('tcp', litellm_port, wireguard_subnet)
+      rules.uniq
+    end
+
+    private
+
+    def validate!
+      %w[auth hyperstack state vm ssh network bootstrap ollama vllm wireguard local_client].each do |section|
+        raise Error, "Missing config section [#{section}]" unless @data.key?(section)
+      end
+
+      %w[environment_name flavor_name image_name].each do |key|
+        raise Error, "Missing [vm].#{key} in config #{path}" if blank?(dig('vm', key))
+      end
+
+      if vm_hostname && vm_hostname !~ /\A[a-z0-9](?:[a-z0-9-]{0,61}[a-z0-9])?\z/
+        raise Error, "Invalid [vm].hostname #{vm_hostname.inspect}; use lowercase letters, digits, and hyphens only."
+      end
+
+      %w[username hyperstack_key_name].each do |key|
+        raise Error, "Missing [ssh].#{key} in config #{path}" if blank?(dig('ssh', key))
+      end
+
+      [wireguard_subnet, *allowed_ssh_cidrs, *allowed_wireguard_cidrs].each do |cidr|
+        IPAddr.new(cidr)
+      rescue IPAddr::InvalidAddressError => e
+        raise Error, "Invalid CIDR #{cidr.inspect}: #{e.message}"
+      end
+    end
+
+    def firewall_rule(protocol, port, cidr)
+      ip = IPAddr.new(cidr)
+      {
+        'direction' => 'ingress',
+        'ethertype' => ip.ipv4? ? 'IPv4' : 'IPv6',
+        'protocol' => protocol,
+        'port_range_min' => port,
+        'port_range_max' => port,
+        'remote_ip_prefix' => cidr
+      }
+    end
+
+    def fetch(section, key)
+      dig(section, key)
+    end
+
+    def dig(*keys)
+      keys.reduce(@data) do |memo, key|
+        memo.is_a?(Hash) ? memo[key] : nil
+      end
+    end
+
+    def blank?(value)
+      value.nil? || value.to_s.strip.empty?
+    end
+
+    def truthy?(value)
+      value == true
+    end
+
+    def custom_user_data
+      inline = dig('vm', 'user_data')
+      return inline unless inline.nil? || inline.empty?
+
+      file = dig('vm', 'user_data_file')
+      return nil if file.nil? || file.empty?
+
+      File.read(expand_path(file))
+    end
+
+    def default_hostname_cloud_init
+      <<~CLOUD_INIT
+        #cloud-config
+        preserve_hostname: false
+        hostname: #{vm_hostname}
+      CLOUD_INIT
+    end
+
+    def expand_path(value)
+      return nil if value.nil?
+
+      string = value.to_s
+      return File.expand_path(string) if string.start_with?('~')
+      return string if string.start_with?('/')
+
+      File.expand_path(string, File.dirname(path))
+    end
+
+    def deep_merge(left, right)
+      left.merge(right) do |_key, old_value, new_value|
+        if old_value.is_a?(Hash) && new_value.is_a?(Hash)
+          deep_merge(old_value, new_value)
+        else
+          new_value
+        end
+      end
+    end
+  end
+
+  class StateStore
+    def initialize(path)
+      @path = path
+    end
+
+    attr_reader :path
+
+    def load
+      return nil unless File.exist?(@path)
+
+      JSON.parse(File.read(@path))
+    rescue JSON::ParserError => e
+      raise Error, "Failed to parse state file #{@path}: #{e.message}"
+    end
+
+    def save(payload)
+      temp_path = "#{@path}.tmp"
+      File.write(temp_path, JSON.pretty_generate(payload))
+      File.rename(temp_path, @path)
+    end
+
+    def delete
+      File.delete(@path) if File.exist?(@path)
+    end
+  end
+
+  class HyperstackClient
+    def initialize(base_url:, api_key:)
+      @base_uri = URI(base_url)
+      @api_key = api_key
+    end
+
+    def list_environments
+      response = request(:get, '/core/environments')
+      response.fetch('environments', [])
+    end
+
+    def list_keypairs
+      response = request(:get, '/core/keypairs')
+      response.fetch('keypairs', [])
+    end
+
+    def list_flavors
+      response = request(:get, '/core/flavors')
+      Array(response['data']).flat_map do |entry|
+        Array(entry['flavors']).map do |flavor|
+          flavor.merge(
+            'region_name' => flavor['region_name'] || entry['region_name'],
+            'gpu' => flavor['gpu'] || entry['gpu']
+          )
+        end
+      end
+    end
+
+    def list_images
+      response = request(:get, '/core/images')
+      Array(response['images']).flat_map do |entry|
+        Array(entry['images']).map do |image|
+          image.merge(
+            'region_name' => image['region_name'] || entry['region_name'],
+            'type' => image['type'] || entry['type']
+          )
+        end
+      end
+    end
+
+    def list_vms
+      response = request(:get, '/core/virtual-machines')
+      response.fetch('instances', [])
+    end
+
+    def get_vm(vm_id)
+      response = request(:get, "/core/virtual-machines/#{vm_id}")
+      response.fetch('instance', nil)
+    end
+
+    def create_vm(payload)
+      request(:post, '/core/virtual-machines', payload)
+    end
+
+    def delete_vm(vm_id)
+      request(:delete, "/core/virtual-machines/#{vm_id}")
+    end
+
+    def create_vm_rule(vm_id, payload)
+      request(:post, "/core/virtual-machines/#{vm_id}/sg-rules", payload)
+    end
+
+    private
+
+    def request(method, path, payload = nil)
+      uri = @base_uri.dup
+      uri.path = "#{@base_uri.path}#{path}"
+
+      request = case method
+                when :get
+                  Net::HTTP::Get.new(uri)
+                when :post
+                  Net::HTTP::Post.new(uri)
+                when :delete
+                  Net::HTTP::Delete.new(uri)
+                else
+                  raise Error, "Unsupported HTTP method: #{method}"
+                end
+
+      request['accept'] = 'application/json'
+      request['api_key'] = @api_key
+      if payload
+        request['content-type'] = 'application/json'
+        request.body = JSON.generate(payload)
+      end
+
+      retries_left = 4
+      begin
+        response = Net::HTTP.start(
+          uri.host,
+          uri.port,
+          use_ssl: uri.scheme == 'https',
+          open_timeout: 30,
+          read_timeout: 120
+        ) { |http| http.request(request) }
+
+        parse_response(response)
+      rescue Timeout::Error, Errno::ECONNREFUSED, Errno::ECONNRESET,
+             Errno::EHOSTUNREACH, Errno::ENETUNREACH,
+             SocketError, OpenSSL::SSL::SSLError, Net::OpenTimeout => e
+        raise Error, "Hyperstack API request failed for #{path}: #{e.message}" if retries_left <= 0
+
+        retries_left -= 1
+        delay = (4 - retries_left) * 5
+        warn "API request to #{path} failed (#{e.class}: #{e.message}), retrying in #{delay}s (#{retries_left} left)..."
+        sleep delay
+        retry
+      end
+    end
+
+    def parse_response(response)
+      body = response.body.to_s
+      payload = body.empty? ? {} : JSON.parse(body)
+
+      if response.code.to_i >= 400 || payload['status'] == false
+        message = payload['message'] || payload['error_reason'] || response.message
+        raise Error, "Hyperstack API error (HTTP #{response.code}): #{message}"
+      end
+
+      payload
+    rescue JSON::ParserError => e
+      raise Error, "Failed to parse Hyperstack API response: #{e.message}"
+    end
+  end
+
+  class LocalWireGuard
+    def initialize(interface_name:, config_path:)
+      @interface_name = interface_name
+      @config_path = config_path
+    end
+
+    def status
+      {
+        'service_state' => service_state,
+        'config_path' => @config_path,
+        'endpoint' => configured_endpoint,
+        'config_readable' => !config_contents.nil?
+      }
+    end
+
+    private
+
+    def service_state
+      stdout, _stderr, status = Open3.capture3('systemctl', 'is-active', "wg-quick@#{@interface_name}")
+      value = stdout.to_s.strip
+      return value unless value.empty?
+      return 'active' if status.success?
+
+      'unknown'
+    end
+
+    def configured_endpoint
+      content = config_contents
+      return nil if content.nil?
+
+      parse_wireguard_config(content)['Endpoint']
+    end
+
+    def config_contents
+      return @config_contents if defined?(@config_contents)
+
+      @config_contents = File.read(@config_path)
+    rescue Errno::EACCES, Errno::ENOENT
+      stdout, _stderr, status = Open3.capture3('sudo', '-n', 'cat', @config_path)
+      @config_contents = status.success? ? stdout : nil
+    end
+
+    def parse_wireguard_config(content)
+      current_section = nil
+      peer = {}
+
+      content.each_line do |line|
+        stripped = line.strip
+        next if stripped.empty? || stripped.start_with?('#')
+
+        if stripped.start_with?('[') && stripped.end_with?(']')
+          current_section = stripped[1..-2]
+          next
+        end
+
+        key, value = stripped.split('=', 2).map { |part| part&.strip }
+        next unless current_section == 'Peer' && key && value
+
+        peer[key] = value
+      end
+
+      peer
+    end
+  end
+
+  class Manager
+    def initialize(config:, client:, state_store:, local_wireguard:, out: $stdout)
+      @config = config
+      @client = client
+      @state_store = state_store
+      @local_wireguard = local_wireguard
+      @out = out
+    end
+
+    def create(replace: false, dry_run: false, install_vllm: nil, install_ollama: nil)
+      # CLI flags override config; nil means "use config default".
+      @effective_vllm = install_vllm.nil? ? @config.vllm_install_enabled? : install_vllm
+      @effective_ollama = install_ollama.nil? ? @config.ollama_install_enabled? : install_ollama
+      existing_state = @state_store.load
+      if existing_state && existing_state['vm_id']
+        if replace
+          if dry_run
+            info "DRY RUN: would delete tracked VM #{existing_state['vm_id']} before creating a replacement."
+          else
+            delete(vm_id: existing_state['vm_id'], preserve_state_on_failure: true)
+          end
+        elsif resumable_state?(existing_state)
+          if dry_run
+            print_resume_dry_run(existing_state)
+            return
+          end
+
+          info "Resuming tracked VM #{existing_state['vm_id']} provisioning..."
+          continue_create(existing_state)
+          return
+        else
+          raise Error,
+                "State file #{@state_store.path} already tracks VM #{existing_state['vm_id']}. Use --replace or delete first."
+        end
+      end
+
+      resolved = resolve_dependencies
+      vm_name = @config.generated_vm_name
+      if dry_run
+        info "Planning VM #{vm_name} in #{resolved[:environment]['name']} using #{@config.flavor_name}..."
+      else
+        info "Creating VM #{vm_name} in #{resolved[:environment]['name']} using #{@config.flavor_name}..."
+      end
+
+      payload = build_create_payload(vm_name, resolved)
+      if dry_run
+        print_create_dry_run(vm_name, resolved, payload)
+        return
+      end
+
+      response = @client.create_vm(payload)
+      instance = Array(response['instances']).first
+      raise Error, 'Hyperstack create response did not include an instance ID.' unless instance && instance['id']
+
+      state = {
+        'vm_id' => instance['id'],
+        'vm_name' => vm_name,
+        'environment_name' => resolved[:environment]['name'],
+        'region' => resolved[:environment]['region'],
+        'flavor_name' => resolved[:flavor]['name'],
+        'image_name' => resolved[:image]['name'],
+        'key_name' => resolved[:keypair]['name'],
+        'public_ip' => instance['floating_ip'],
+        'created_at' => Time.now.utc.iso8601
+      }
+      @state_store.save(state)
+      continue_create(state)
+    end
+
+    def delete(vm_id: nil, preserve_state_on_failure: false, dry_run: false)
+      state = @state_store.load
+      target_vm_id = vm_id || state&.dig('vm_id')
+      raise Error, "No VM ID provided and no state file found at #{@state_store.path}." if target_vm_id.nil?
+
+      if dry_run
+        print_delete_dry_run(target_vm_id, state, preserve_state_on_failure: preserve_state_on_failure)
+        return
+      end
+
+      info "Deleting VM #{target_vm_id}..."
+      @client.delete_vm(target_vm_id)
+      wait_for_deletion(target_vm_id)
+      @state_store.delete unless preserve_state_on_failure
+      info "VM #{target_vm_id} deleted."
+    rescue Error
+      raise if preserve_state_on_failure
+
+      @state_store.delete
+      raise
+    end
+
+    def status
+      state = @state_store.load
+      if state.nil?
+        info "No tracked VM state file at #{@state_store.path}."
+      else
+        begin
+          vm = @client.get_vm(state['vm_id'])
+          desired = @config.desired_security_rules.map { |rule| normalize_rule(rule) }
+          current = Array(vm['security_rules']).map { |rule| normalize_rule(rule) }
+          missing_rules = desired - current
+
+          info "Tracked VM: #{state['vm_id']} #{vm['name']}"
+          info "Status: #{vm['status']} / #{vm['vm_state']}"
+          info "Public IP: #{connect_host_for(vm) || 'none'}"
+          info "Missing firewall rules: #{missing_rules.empty? ? 'none' : missing_rules.size}"
+        rescue Error => e
+          warn "Unable to load VM #{state['vm_id']}: #{e.message}"
+        end
+      end
+
+      print_local_wireguard_summary(state&.dig('public_ip'))
+    end
+
+    # Runs end-to-end inference tests against vLLM and LiteLLM over WireGuard.
+    # Requires wg1 to be active and the VM to be fully provisioned.
+    def test
+      state = @state_store.load
+      raise Error, "No tracked VM state file found at #{@state_store.path}." if state.nil?
+
+      wg_ip = @config.wireguard_gateway_ip
+      info "Running end-to-end inference tests via WireGuard (#{wg_ip})..."
+
+      if @config.vllm_install_enabled?
+        test_vllm(wg_ip)
+        test_litellm(wg_ip)
+      end
+
+      if @config.ollama_install_enabled?
+        info "  Ollama test: connect via SSH and run 'ollama list' to verify models."
+      end
+
+      info 'All inference tests passed.'
+    end
+
+    private
+
+    def resumable_state?(state)
+      state['vm_id'] && (
+        state['bootstrapped_at'].nil? ||
+        ollama_setup_needed?(state) ||
+        vllm_setup_needed?(state) ||
+        wireguard_setup_needed?(state)
+      )
+    end
+
+    def continue_create(state)
+      vm_id = state['vm_id']
+
+      vm = wait_for_vm_ready(vm_id)
+      ensure_security_rules(vm)
+      vm = wait_for_connect_ip(vm_id)
+      state['public_ip'] = connect_host_for(vm)
+      state['security_rules'] = Array(vm['security_rules']).map { |rule| normalize_rule(rule) }
+      @state_store.save(state)
+
+      wait_for_ssh(state['public_ip'])
+      if @config.guest_bootstrap_enabled? && state['bootstrapped_at'].nil?
+        bootstrap_guest(state['public_ip'])
+        state['bootstrapped_at'] = Time.now.utc.iso8601
+        @state_store.save(state)
+      end
+
+      # Install Ollama binary and configure the service (fast), but defer
+      # model pulls until after the WireGuard tunnel is up so that the user
+      # can monitor progress over the tunnel.
+      if effective_ollama? && state['ollama_installed_at'].nil?
+        install_ollama_service(state['public_ip'])
+        state['ollama_installed_at'] = Time.now.utc.iso8601
+        @state_store.save(state)
+      end
+
+      if wireguard_setup_needed?(state)
+        run_wireguard_setup(state['public_ip'])
+        state['wireguard_setup_at'] = Time.now.utc.iso8601
+        @state_store.save(state)
+      end
+
+      # Pull and verify Ollama models after the tunnel is established.
+      if ollama_setup_needed?(state)
+        pull_ollama_models(state['public_ip'])
+        state['ollama_setup_at'] = Time.now.utc.iso8601
+        state['ollama_models_dir'] = @config.ollama_models_dir
+        state['ollama_pulled_models'] = desired_ollama_models
+        @state_store.save(state)
+      end
+
+      # Set up vLLM (Docker container) + LiteLLM (Anthropic-API proxy) after
+      # the tunnel is up so that model-download progress is visible locally.
+      if vllm_setup_needed?(state)
+        setup_vllm_stack(state['public_ip'])
+        state['vllm_setup_at'] = Time.now.utc.iso8601
+        state['vllm_model'] = @config.vllm_model
+        @state_store.save(state)
+      end
+
+      vm = @client.get_vm(vm_id)
+      state['security_rules'] = Array(vm['security_rules']).map { |rule| normalize_rule(rule) }
+      state['status'] = vm['status']
+      state['vm_state'] = vm['vm_state']
+      state['provisioned_at'] = Time.now.utc.iso8601
+      @state_store.save(state)
+
+      info "VM ready: #{state['public_ip']} (id=#{state['vm_id']})"
+      print_local_wireguard_summary(state['public_ip'])
+      if effective_vllm?
+        wg_ip = @config.wireguard_gateway_ip
+        info "Run 'ruby hyperstack.rb test' to verify vLLM and LiteLLM."
+        info "  vLLM:    http://#{wg_ip}:#{@config.ollama_port}/v1/models"
+        info "  LiteLLM: http://#{wg_ip}:#{@config.litellm_port}/v1/messages"
+      end
+    end
+
+    def build_create_payload(vm_name, resolved)
+      payload = {
+        'name' => vm_name,
+        'count' => 1,
+        'environment_name' => resolved[:environment]['name'],
+        'flavor_name' => resolved[:flavor]['name'],
+        'image_name' => resolved[:image]['name'],
+        'key_name' => resolved[:keypair]['name'],
+        'assign_floating_ip' => @config.assign_floating_ip?,
+        'create_bootable_volume' => @config.create_bootable_volume?,
+        'enable_port_randomization' => @config.enable_port_randomization?,
+        'security_rules' => @config.desired_security_rules
+      }
+      payload['labels'] = @config.labels unless @config.labels.empty?
+      payload['user_data'] = @config.user_data if @config.user_data
+      payload
+    end
+
+    def resolve_dependencies
+      environment = @client.list_environments.find { |item| item['name'] == @config.environment_name }
+      raise Error, "Environment #{@config.environment_name.inspect} was not found in Hyperstack." unless environment
+
+      flavor = @client.list_flavors.find do |item|
+        item['name'] == @config.flavor_name && item['region_name'] == environment['region']
+      end
+      raise Error, "Flavor #{@config.flavor_name.inspect} is not available in #{environment['region']}." unless flavor
+
+      if flavor['stock_available'] == false
+        raise Error,
+              "Flavor #{@config.flavor_name.inspect} exists in #{environment['region']} but is out of stock."
+      end
+
+      image = @client.list_images.find do |item|
+        item['name'] == @config.image_name && item['region_name'] == environment['region']
+      end
+      raise Error, "Image #{@config.image_name.inspect} is not available in #{environment['region']}." unless image
+
+      keypair = @client.list_keypairs.find do |item|
+        item['name'] == @config.ssh_key_name && item.dig('environment', 'name') == environment['name']
+      end
+      unless keypair
+        raise Error,
+              "Keypair #{@config.ssh_key_name.inspect} was not found in environment #{environment['name']}."
+      end
+
+      {
+        environment: environment,
+        flavor: flavor,
+        image: image,
+        keypair: keypair
+      }
+    end
+
+    def wait_for_vm_ready(vm_id)
+      with_polling("VM #{vm_id} to become ready for firewall updates") do
+        vm = @client.get_vm(vm_id)
+        next nil if vm.nil?
+
+        raise Error, "VM #{vm_id} entered failed state #{vm['status']} / #{vm['vm_state']}." if failed_vm?(vm)
+
+        vm_ready_for_updates?(vm) ? vm : nil
+      end
+    end
+
+    def wait_for_connect_ip(vm_id)
+      ip_label = @config.assign_floating_ip? ? 'floating IP' : 'reachable IP'
+      with_polling("VM #{vm_id} to receive a #{ip_label}") do
+        vm = @client.get_vm(vm_id)
+        raise Error, "VM #{vm_id} entered failed state #{vm['status']} / #{vm['vm_state']}." if failed_vm?(vm)
+
+        connect_host_for(vm) ? vm : nil
+      end
+    end
+
+    def wait_for_ssh(host)
+      # Remove stale host key for this IP — VMs frequently reuse IPs after
+      # delete/recreate, causing StrictHostKeyChecking to reject the new key
+      remove_stale_host_key(host)
+      info "Waiting for SSH on #{host}:#{@config.ssh_port}..."
+      with_polling("SSH on #{host}:#{@config.ssh_port}") do
+        next nil unless tcp_open?(host, @config.ssh_port)
+
+        stdout, stderr, status = run_ssh_command(host, 'true')
+        if status.success?
+          true
+        else
+          warn "SSH not ready yet: #{stderr.strip}" unless stderr.to_s.strip.empty?
+          nil
+        end
+      end
+    end
+
+    def ensure_security_rules(vm)
+      existing = Array(vm['security_rules']).map { |rule| normalize_rule(rule) }
+      desired = @config.desired_security_rules.map { |rule| normalize_rule(rule) }
+
+      (desired - existing).each do |rule|
+        info "Adding Hyperstack firewall rule #{rule['protocol']} #{rule['remote_ip_prefix']} #{rule['port_range_min']}..."
+        @client.create_vm_rule(vm['id'], rule)
+      end
+    end
+
+    def bootstrap_guest(host)
+      info 'Bootstrapping Ubuntu guest over SSH...'
+      retries = 3
+      retries.times do |attempt|
+        stdout, stderr, status = run_ssh_command(host, guest_bootstrap_script)
+        return if status.success?
+
+        msg = stderr.strip.empty? ? stdout : stderr
+        raise Error, "Guest bootstrap failed after #{retries} attempts: #{msg}" if attempt == retries - 1
+
+        warn "Bootstrap attempt #{attempt + 1}/#{retries} failed (#{msg.lines.last&.strip}), retrying in 15s..."
+        sleep 15
+      end
+    end
+
+    def ollama_setup_needed?(state)
+      return false unless effective_ollama?
+      # Re-run setup if state has no record, or if desired models changed
+      return true if state['ollama_setup_at'].nil?
+
+      model_list_signature(desired_ollama_models) != model_list_signature(state['ollama_pulled_models'])
+    end
+
+    def install_ollama_service(host)
+      info "Installing and configuring Ollama on #{host}..."
+      output, status = run_ssh_command_streaming(host, ollama_install_script)
+      raise Error, "Ollama install failed: #{output.strip}" unless status.success?
+    end
+
+    def pull_ollama_models(host)
+      info "Pulling Ollama models on #{host}..."
+      output, status = run_ssh_command_streaming(host, ollama_pull_script)
+      raise Error, "Ollama model pull failed: #{output.strip}" unless status.success?
+
+      # Verify all models are actually present on the remote (belt-and-suspenders
+      # check in case ollama pull returned 0 without actually pulling the model)
+      verify_remote_models(host)
+    end
+
+    def verify_remote_models(host)
+      stdout, _stderr, status = run_ssh_command(host, 'ollama list')
+      return unless status.success?
+
+      remote_models = stdout.lines.drop(1).map { |l| l.split.first }.compact
+      missing = desired_ollama_models.reject { |m| remote_models.any? { |r| r.start_with?(m) } }
+      return if missing.empty?
+
+      raise Error, "Models missing after setup: #{missing.join(', ')}. Remote has: #{remote_models.join(', ')}"
+    end
+
+    def wireguard_setup_needed?(state)
+      return false unless @config.wireguard_auto_setup?
+
+      public_ip = state['public_ip'].to_s.strip
+      return true if public_ip.empty?
+
+      expected_endpoint = "#{public_ip}:#{@config.wireguard_udp_port}"
+      @local_wireguard.status['endpoint'] != expected_endpoint
+    end
+
+    def run_wireguard_setup(host)
+      validate_wireguard_setup_script!
+      retries = 3
+      retries.times do |attempt|
+        info "Running WireGuard auto-setup via #{@config.wireguard_setup_script} #{host}..."
+
+        status = run_wireguard_script(host)
+        return if status.success?
+
+        if attempt == retries - 1
+          raise Error, "WireGuard setup failed after #{retries} attempts (exit #{status.exitstatus})."
+        end
+
+        delay = (attempt + 1) * 15
+        warn "WireGuard setup attempt #{attempt + 1}/#{retries} failed (exit #{status.exitstatus}), retrying in #{delay}s..."
+        sleep delay
+      end
+    end
+
+    def run_wireguard_script(host)
+      Open3.popen2e('bash', @config.wireguard_setup_script, host) do |stdin, output, wait_thr|
+        stdin.sync = true
+        stdin.puts
+        stdin.close
+
+        output.each { |line| @out.print(line) }
+        wait_thr.value
+      end
+    end
+
+    def wait_for_deletion(vm_id)
+      info "Waiting for VM #{vm_id} deletion to complete..."
+      with_polling("VM #{vm_id} deletion", timeout: 300) do
+        @client.get_vm(vm_id)
+        nil
+      rescue Error => e
+        raise unless e.message.include?('not_found') || e.message.include?('does not exists')
+
+        true
+      end
+    end
+
+    def connect_host_for(vm)
+      return vm['floating_ip'] if @config.assign_floating_ip?
+
+      vm['floating_ip'] || vm['fixed_ip']
+    end
+
+    def validate_wireguard_setup_script!
+      script_path = @config.wireguard_setup_script
+      raise Error, "WireGuard setup script not found: #{script_path}" unless File.exist?(script_path)
+
+      mismatches = []
+      mismatches << "ssh.username must be 'ubuntu'" unless @config.ssh_username == 'ubuntu'
+      mismatches << "local_client.interface_name must be 'wg1'" unless @config.local_interface_name == 'wg1'
+      mismatches << 'network.wireguard_udp_port must be 56710' unless @config.wireguard_udp_port == 56_710
+      mismatches << "network.wireguard_subnet must be '192.168.3.0/24'" unless @config.wireguard_subnet == '192.168.3.0/24'
+
+      return if mismatches.empty?
+
+      raise Error, "Configured WireGuard settings do not match #{script_path}: #{mismatches.join('; ')}"
+    end
+
+    def remove_stale_host_key(host)
+      system('ssh-keygen', '-R', host, out: File::NULL, err: File::NULL)
+      # Also remove bracketed form for non-standard ports
+      if @config.ssh_port != 22
+        system('ssh-keygen', '-R', "[#{host}]:#{@config.ssh_port}", out: File::NULL, err: File::NULL)
+      end
+    end
+
+    def failed_vm?(vm)
+      [vm['status'], vm['vm_state'], vm['power_state']].compact.any? do |value|
+        value.to_s.downcase.match?(/error|failed|deleted|shelved/)
+      end
+    end
+
+    def vm_ready_for_updates?(vm)
+      %w[ACTIVE SHUTOFF HIBERNATED].include?(vm['status'].to_s.upcase)
+    end
+
+    def tcp_open?(host, port)
+      Socket.tcp(host, port, connect_timeout: @config.ssh_connect_timeout) do |sock|
+        sock.close
+        true
+      end
+    rescue Errno::ECONNREFUSED, Errno::ETIMEDOUT, Errno::EHOSTUNREACH, Errno::ENETUNREACH, SocketError, IOError
+      false
+    end
+
+    def run_ssh_command(host, remote_script)
+      Open3.capture3(*ssh_command(host), stdin_data: remote_script)
+    end
+
+    def run_ssh_command_streaming(host, remote_script)
+      combined_output = +''
+      Open3.popen2e(*ssh_command(host)) do |stdin, output, wait_thr|
+        stdin.write(remote_script)
+        stdin.close
+
+        output.each do |line|
+          combined_output << line
+          @out.print(line)
+        end
+
+        return [combined_output, wait_thr.value]
+      end
+    end
+
+    def ssh_command(host)
+      command = [
+        'ssh',
+        '-o', 'BatchMode=yes',
+        '-o', 'StrictHostKeyChecking=accept-new',
+        '-o', "ConnectTimeout=#{@config.ssh_connect_timeout}",
+        '-p', @config.ssh_port.to_s
+      ]
+      if File.exist?(@config.ssh_private_key_path)
+        command.concat(['-i', @config.ssh_private_key_path])
+      else
+        warn "SSH private key #{@config.ssh_private_key_path} does not exist; falling back to default ssh-agent identity."
+      end
+
+      command << "#{@config.ssh_username}@#{host}"
+      command << 'bash -se'
+      command
+    end
+
+    def with_polling(description, timeout: 900, interval: 5)
+      deadline = Time.now + timeout
+      loop do
+        result = yield
+        return result if result
+
+        raise Error, "Timed out waiting for #{description}." if Time.now >= deadline
+
+        sleep interval
+      end
+    end
+
+    def normalize_rule(rule)
+      {
+        'direction' => rule['direction'].to_s.downcase,
+        'ethertype' => rule['ethertype'].to_s,
+        'protocol' => rule['protocol'].to_s.downcase,
+        'port_range_min' => integer_or_nil(rule['port_range_min']),
+        'port_range_max' => integer_or_nil(rule['port_range_max']),
+        'remote_ip_prefix' => rule['remote_ip_prefix'].to_s
+      }
+    end
+
+    def print_create_dry_run(vm_name, resolved, payload)
+      info 'DRY RUN: no VM or state file will be created.'
+      info "State file: #{@state_store.path}"
+      info "Resolved environment: #{resolved[:environment]['name']} (region #{resolved[:environment]['region']})"
+      info "Resolved flavor: #{format_flavor(resolved[:flavor])}"
+      info "Resolved image: #{resolved[:image]['name']}"
+      info "Resolved SSH keypair: #{resolved[:keypair]['name']}"
+      info "Planned VM name: #{vm_name}"
+      info 'Create payload:'
+      @out.puts(JSON.pretty_generate(payload))
+      if @config.guest_bootstrap_enabled?
+        info 'Guest bootstrap script:'
+        @out.puts(guest_bootstrap_script)
+      else
+        info 'Guest bootstrap is disabled in config.'
+      end
+      if effective_ollama?
+        info "Ollama will be installed with models stored under #{@config.ollama_models_dir}"
+        unless desired_ollama_models.empty?
+          info "Ollama models to pre-pull: #{desired_ollama_models.join(', ')}"
+        end
+      end
+      if effective_vllm?
+        info "vLLM will be installed: #{@config.vllm_model}"
+        info "  Container: #{@config.vllm_container_name}, port #{@config.ollama_port}, max_model_len #{@config.vllm_max_model_len}"
+        info "LiteLLM proxy will be installed on port #{@config.litellm_port}"
+        info "  Claude model aliases: #{@config.litellm_claude_model_names.join(', ')}"
+      end
+      if @config.wireguard_auto_setup?
+        info "WireGuard auto-setup script: #{@config.wireguard_setup_script} <vm_public_ip>"
+      end
+      print_local_wireguard_summary(nil)
+    end
+
+    def print_resume_dry_run(state)
+      info "DRY RUN: would resume provisioning tracked VM #{state['vm_id']}."
+      begin
+        vm = @client.get_vm(state['vm_id'])
+        info "Tracked VM status: #{vm['status']} / #{vm['vm_state']}"
+        info "Tracked VM public IP: #{connect_host_for(vm) || 'none'}"
+      rescue Error => e
+        warn "Unable to inspect tracked VM #{state['vm_id']}: #{e.message}"
+      end
+      if @config.guest_bootstrap_enabled?
+        info 'Guest bootstrap script:'
+        @out.puts(guest_bootstrap_script)
+      end
+      if ollama_setup_needed?(state)
+        info "Ollama would be installed with models stored under #{@config.ollama_models_dir}"
+        unless desired_ollama_models.empty?
+          info "Ollama models to pre-pull: #{desired_ollama_models.join(', ')}"
+        end
+      end
+      if vllm_setup_needed?(state)
+        info "vLLM would be installed: #{@config.vllm_model}"
+        info "LiteLLM proxy would be installed on port #{@config.litellm_port}"
+      end
+      if wireguard_setup_needed?(state)
+        info "WireGuard auto-setup script would run: #{@config.wireguard_setup_script} #{state['public_ip'] || '<pending-public-ip>'}"
+      end
+      print_local_wireguard_summary(state['public_ip'])
+    end
+
+    def print_delete_dry_run(target_vm_id, state, preserve_state_on_failure:)
+      info 'DRY RUN: no VM will be deleted.'
+      begin
+        vm = @client.get_vm(target_vm_id)
+        info "Delete target: #{target_vm_id} #{vm['name']} (#{vm['status']} / #{vm['vm_state']})"
+        info "Delete target public IP: #{connect_host_for(vm) || 'none'}"
+      rescue Error => e
+        warn "Unable to inspect VM #{target_vm_id} before delete: #{e.message}"
+      end
+
+      if state && state['vm_id'].to_i == target_vm_id.to_i
+        action = preserve_state_on_failure ? 'would remain unchanged' : 'would be removed'
+        info "Tracked state file #{@state_store.path} #{action}."
+      else
+        info 'No tracked state entry would be modified.'
+      end
+    end
+
+    def format_flavor(flavor)
+      gpu = flavor['gpu'].to_s.empty? ? 'CPU-only' : flavor['gpu']
+      [
+        flavor['name'],
+        gpu,
+        "#{flavor['gpu_count']} GPU",
+        "#{flavor['ram']} GB RAM",
+        "#{flavor['cpu']} vCPU",
+        "stock=#{flavor['stock_available']}"
+      ].join(', ')
+    end
+
+    def guest_bootstrap_script
+      script = []
+      script << 'set -euo pipefail'
+
+      # Wait for any running unattended-upgrades or apt locks to release
+      # before attempting package operations (transient lock on fresh VMs)
+      script << 'echo "Waiting for apt locks to clear..."'
+      script << 'for i in $(seq 1 30); do'
+      script << '  if ! fuser /var/lib/dpkg/lock-frontend /var/lib/apt/lists/lock /var/cache/apt/archives/lock >/dev/null 2>&1; then break; fi'
+      script << '  echo "  apt lock held, waiting ($i/30)..."; sleep 10'
+      script << 'done'
+      script << 'sudo systemctl stop unattended-upgrades.service 2>/dev/null || true'
+      script << 'sudo systemctl disable unattended-upgrades.service 2>/dev/null || true'
+
+      if @config.install_wireguard?
+        script << 'which wg >/dev/null 2>&1 || (sudo apt-get update && sudo apt-get install -y wireguard)'
+      end
+
+      if @config.configure_ufw?
+        script << "sudo ufw allow #{@config.ssh_port}/tcp comment 'Allow SSH' >/dev/null 2>&1 || true"
+        script << 'sudo ufw --force enable >/dev/null 2>&1 || true'
+        script << "sudo ufw allow #{@config.wireguard_udp_port}/udp comment 'WireGuard #{@config.local_interface_name}' >/dev/null 2>&1 || true"
+        # Port 11434 is shared by Ollama and vLLM; open for both regardless of which is installed.
+        script << "sudo ufw allow from #{Shellwords.escape(@config.wireguard_subnet)} to any port #{@config.ollama_port} proto tcp comment 'Inference API (Ollama/vLLM) via #{@config.local_interface_name}' >/dev/null 2>&1 || true"
+        # Port 4000: LiteLLM proxy (Anthropic API → vLLM); open alongside the inference port.
+        script << "sudo ufw allow from #{Shellwords.escape(@config.wireguard_subnet)} to any port #{@config.litellm_port} proto tcp comment 'LiteLLM proxy via #{@config.local_interface_name}' >/dev/null 2>&1 || true"
+      end
+
+      if @config.configure_ollama_host?
+        # Only write a minimal OLLAMA_HOST override if no override exists yet;
+        # ollama_setup_script writes the full override (OLLAMA_MODELS, GPU_OVERHEAD, etc.)
+        script << "if systemctl list-unit-files | grep -q '^ollama.service'; then"
+        script << '  if [ ! -f /etc/systemd/system/ollama.service.d/override.conf ]; then'
+        script << '    sudo mkdir -p /etc/systemd/system/ollama.service.d'
+        script << "    cat <<'OVERRIDE' | sudo tee /etc/systemd/system/ollama.service.d/override.conf >/dev/null"
+        script << '[Service]'
+        script << "Environment=\"OLLAMA_HOST=0.0.0.0:#{@config.ollama_port}\""
+        script << 'OVERRIDE'
+        script << '    sudo systemctl daemon-reload'
+        script << '    sudo systemctl restart ollama || true'
+        script << '  fi'
+        script << 'fi'
+      end
+
+      script << 'echo bootstrap-ok'
+      script.join("\n")
+    end
+
+    def desired_ollama_models
+      normalized_model_list(@config.ollama_pull_models)
+    end
+
+    def normalized_model_list(models)
+      Array(models).each_with_object([]) do |model, ordered|
+        normalized = model.to_s.strip
+        next if normalized.empty? || ordered.include?(normalized)
+
+        ordered << normalized
+      end
+    end
+
+    def model_list_signature(models)
+      normalized_model_list(models).sort
+    end
+
+    # Installs the Ollama binary, configures the systemd override (models dir,
+    # listen host, GPU overhead, parallelism), and starts the service.  Model
+    # pulls are handled separately by ollama_pull_script so that the WireGuard
+    # tunnel can be established first.
+    def ollama_install_script
+      models_dir = @config.ollama_models_dir
+      listen_host = @config.ollama_listen_host
+
+      script = []
+      script << 'set -euo pipefail'
+      script << 'sudo pkill -f unattended-upgrade >/dev/null 2>&1 || true'
+      script << "if ! command -v ollama >/dev/null 2>&1; then curl -fsSL https://ollama.ai/install.sh | sh; fi"
+      if models_dir.start_with?('/ephemeral')
+        script << "mountpoint -q /ephemeral || { echo 'Expected /ephemeral mount is missing'; exit 1; }"
+      end
+      script << "sudo mkdir -p #{Shellwords.escape(models_dir)}"
+      script << "sudo chown -R ollama:ollama #{Shellwords.escape(File.dirname(models_dir))}"
+      script << 'sudo mkdir -p /etc/systemd/system/ollama.service.d'
+      script << "cat <<'OVERRIDE' | sudo tee /etc/systemd/system/ollama.service.d/override.conf >/dev/null"
+      script << '[Service]'
+      script << "Environment=\"OLLAMA_MODELS=#{models_dir}\""
+      script << "Environment=\"OLLAMA_GPU_OVERHEAD=#{@config.ollama_gpu_overhead_mb}\""
+      script << "Environment=\"OLLAMA_NUM_PARALLEL=#{@config.ollama_num_parallel}\""
+      script << "Environment=\"OLLAMA_CONTEXT_LENGTH=#{@config.ollama_context_length}\""
+      script << "Environment=\"OLLAMA_HOST=#{listen_host}\""
+      script << 'OVERRIDE'
+      script << 'sudo systemctl daemon-reload'
+      script << 'sudo systemctl enable --now ollama'
+      script << 'sudo systemctl restart ollama'
+      script << 'sleep 3'
+      script << 'systemctl is-active --quiet ollama'
+      script << 'echo ollama-install-ok'
+      script.join("\n")
+    end
+
+    # Pulls each configured model with retry and per-model + final verification.
+    # Run after WireGuard is up so the user can monitor progress over the tunnel.
+    def ollama_pull_script
+      models_dir = @config.ollama_models_dir
+      model_pulls = desired_ollama_models
+
+      script = []
+      script << 'set -euo pipefail'
+      # Pull each model with retry (transient network failures) and verify
+      # it is actually present afterwards
+      model_pulls.each do |model|
+        escaped = Shellwords.escape(model)
+        script << "echo \"Pulling model #{model}...\""
+        script << "for attempt in 1 2 3; do"
+        script << "  if ollama pull #{escaped}; then break; fi"
+        script << "  if [ \"$attempt\" -eq 3 ]; then echo \"FATAL: failed to pull #{model} after 3 attempts\"; exit 1; fi"
+        script << "  echo \"  pull attempt $attempt failed, retrying in 15s...\"; sleep 15"
+        script << "done"
+        script << "ollama show #{escaped} --modelfile >/dev/null 2>&1 || { echo \"FATAL: model #{model} not found after pull\"; exit 1; }"
+      end
+      # Final verification: ensure all expected models are listed
+      script << 'echo "Verifying all models are present..."'
+      model_pulls.each do |model|
+        escaped = Shellwords.escape(model)
+        script << "ollama show #{escaped} --modelfile >/dev/null 2>&1 || { echo \"FATAL: model #{model} missing in final check\"; exit 1; }"
+      end
+      script << "echo ollama-models-dir=#{models_dir}"
+      script << 'echo ollama-ok'
+      script.join("\n")
+    end
+
+    # Returns the effective Ollama flag: CLI override if set, else config default.
+    def effective_ollama?
+      defined?(@effective_ollama) ? @effective_ollama : @config.ollama_install_enabled?
+    end
+
+    # Returns the effective vLLM flag: CLI override if set, else config default.
+    def effective_vllm?
+      defined?(@effective_vllm) ? @effective_vllm : @config.vllm_install_enabled?
+    end
+
+    def vllm_setup_needed?(state)
+      return false unless effective_vllm?
+      # Re-run if never set up, or if the configured model changed since last setup.
+      return true if state['vllm_setup_at'].nil?
+
+      state['vllm_model'] != @config.vllm_model
+    end
+
+    def setup_vllm_stack(host)
+      info "Setting up vLLM Docker container on #{host}..."
+      output, status = run_ssh_command_streaming(host, vllm_install_script)
+      raise Error, "vLLM install failed: #{output.strip}" unless status.success?
+
+      info "Setting up LiteLLM Anthropic-API proxy on #{host}..."
+      output, status = run_ssh_command_streaming(host, litellm_install_script)
+      raise Error, "LiteLLM install failed: #{output.strip}" unless status.success?
+    end
+
+    # Generates the remote shell script that pulls the vLLM Docker image, starts
+    # the container, and polls until the model is fully loaded (up to 10 minutes
+    # to cover the first-run ~45 GB model download).
+    def vllm_install_script
+      model     = @config.vllm_model
+      cache_dir = @config.vllm_hug_cache_dir
+      container = @config.vllm_container_name
+      max_len   = @config.vllm_max_model_len
+      gpu_util  = @config.vllm_gpu_memory_utilization
+      tp_size   = @config.vllm_tensor_parallel_size
+      parser    = @config.vllm_tool_call_parser
+      port      = @config.ollama_port  # vLLM reuses the Ollama port for firewall compat
+
+      docker_run = [
+        'docker run -d',
+        '--gpus all', '--ipc=host', '--network host',
+        "--name #{Shellwords.escape(container)}",
+        '--restart always',
+        "-v #{Shellwords.escape(cache_dir)}:/root/.cache/huggingface",
+        'vllm/vllm-openai:latest',
+        "--model #{Shellwords.escape(model)}",
+        "--tensor-parallel-size #{tp_size}",
+        '--enable-auto-tool-choice',
+        "--tool-call-parser #{Shellwords.escape(parser)}",
+        '--enable-prefix-caching',
+        "--gpu-memory-utilization #{gpu_util}",
+        "--max-model-len #{max_len}",
+        '--host 0.0.0.0',
+        "--port #{port}"
+      ].join(' ')
+
+      script = []
+      script << 'set -euo pipefail'
+      script << "sudo mkdir -p #{Shellwords.escape(cache_dir)}"
+      script << "sudo chmod -R 0777 #{Shellwords.escape(cache_dir)}"
+      # Stop and remove any existing container so re-runs are idempotent.
+      script << "docker stop #{Shellwords.escape(container)} 2>/dev/null || true"
+      script << "docker rm #{Shellwords.escape(container)} 2>/dev/null || true"
+      script << 'docker pull vllm/vllm-openai:latest'
+      script << docker_run
+      # Poll until the model is loaded:
+      #   first run:    ~45 GB download (~2.5 min) + model load (~65 s) + CUDA graphs (~35 s) ≈ 4-5 min
+      #   warm restart: model load + CUDA graphs ≈ 100 s
+      # Timeout: 120 × 5 s = 10 minutes
+      script << 'echo "Waiting for vLLM to become ready (up to 10 min for first model download)..."'
+      script << "for i in $(seq 1 120); do"
+      script << "  if curl -sf http://localhost:#{port}/v1/models >/dev/null 2>&1; then echo vllm-ready; break; fi"
+      script << "  state=$(docker inspect --format='{{.State.Status}}' #{Shellwords.escape(container)} 2>/dev/null || echo unknown)"
+      script << '  echo "  vLLM not ready yet ($i/120, container=$state)..."'
+      script << '  sleep 5'
+      script << 'done'
+      script << "curl -sf http://localhost:#{port}/v1/models >/dev/null || { echo 'FATAL: vLLM did not become ready within 10 minutes'; exit 1; }"
+      script << 'echo vllm-install-ok'
+      script.join("\n")
+    end
+
+    # Generates the remote shell script that installs LiteLLM in a Python venv,
+    # writes a config mapping Claude model aliases to the vLLM endpoint, and
+    # starts the proxy as a systemd service on litellm_port.
+    def litellm_install_script
+      port        = @config.litellm_port
+      vllm_port   = @config.ollama_port
+      model       = @config.vllm_model
+      claude_names = @config.litellm_claude_model_names
+      master_key  = @config.litellm_master_key
+
+      # Build model_list YAML entries; each Claude alias maps to the vLLM model.
+      # "hosted_vllm/" prefix forces LiteLLM to use /v1/chat/completions (not /v1/responses).
+      model_entries = claude_names.flat_map do |name|
+        [
+          "  - model_name: \"#{name}\"",
+          '    litellm_params:',
+          "      model: \"hosted_vllm/#{model}\"",
+          "      api_base: \"http://localhost:#{vllm_port}/v1\"",
+          '      api_key: "EMPTY"'
+        ]
+      end
+
+      script = []
+      script << 'set -euo pipefail'
+      script << 'sudo apt-get install -y python3.12-venv'
+      script << 'sudo mkdir -p /ephemeral/litellm-env'
+      script << 'sudo chown ubuntu:ubuntu /ephemeral/litellm-env'
+      script << 'python3 -m venv /ephemeral/litellm-env'
+      script << '/ephemeral/litellm-env/bin/pip install --quiet "litellm[proxy]"'
+
+      # Write litellm-config.yaml via heredoc; drop_params silently discards
+      # Claude-specific params (e.g. context_management) that vLLM ignores.
+      script << "sudo tee /ephemeral/litellm-config.yaml > /dev/null << 'LITELLM_YAML'"
+      script << 'model_list:'
+      script.concat(model_entries)
+      script << ''
+      script << 'litellm_settings:'
+      script << '  drop_params: true'
+      script << ''
+      script << 'general_settings:'
+      script << "  master_key: \"#{master_key}\""
+      script << 'LITELLM_YAML'
+
+      # Write systemd unit via heredoc; restart on failure so transient crashes self-heal.
+      script << "sudo tee /etc/systemd/system/litellm.service > /dev/null << 'LITELLM_UNIT'"
+      script << '[Unit]'
+      script << 'Description=LiteLLM Proxy'
+      script << 'After=network.target docker.service'
+      script << 'Requires=docker.service'
+      script << ''
+      script << '[Service]'
+      script << 'Type=simple'
+      script << 'User=ubuntu'
+      script << "ExecStart=/ephemeral/litellm-env/bin/litellm --config /ephemeral/litellm-config.yaml --host 0.0.0.0 --port #{port}"
+      script << 'Restart=always'
+      script << 'RestartSec=5'
+      script << ''
+      script << '[Install]'
+      script << 'WantedBy=multi-user.target'
+      script << 'LITELLM_UNIT'
+
+      script << 'sudo systemctl daemon-reload'
+      script << 'sudo systemctl enable --now litellm'
+      script << 'sleep 5'
+      script << 'systemctl is-active --quiet litellm'
+      script << 'echo litellm-install-ok'
+      script.join("\n")
+    end
+
+    # Tests the vLLM OpenAI-compatible API: lists loaded models and runs a
+    # short inference request to confirm the model accepts requests.
+    def test_vllm(wg_ip)
+      port  = @config.ollama_port
+      model = @config.vllm_model
+
+      info "  Testing vLLM models list at http://#{wg_ip}:#{port}/v1/models..."
+      uri  = URI("http://#{wg_ip}:#{port}/v1/models")
+      resp = Net::HTTP.get_response(uri)
+      raise Error, "vLLM /v1/models returned HTTP #{resp.code}" unless resp.code == '200'
+
+      models = JSON.parse(resp.body).fetch('data', []).map { |m| m['id'] }
+      raise Error, "vLLM returned an empty model list (expected #{model})" if models.empty?
+
+      info "    Models loaded: #{models.join(', ')}"
+      info "  Testing vLLM inference..."
+      reply = vllm_chat(wg_ip, port, model, 'Say hello in five words.')
+      info "    vLLM response: #{reply}"
+    rescue Errno::ECONNREFUSED, Errno::EHOSTUNREACH, SocketError => e
+      raise Error, "Cannot reach vLLM at #{wg_ip}:#{port} — is WireGuard (wg1) active? (#{e.message})"
+    end
+
+    # Tests the LiteLLM proxy using the Anthropic Messages API format,
+    # which is what Claude Code sends when pointed at a custom base URL.
+    def test_litellm(wg_ip)
+      port  = @config.litellm_port
+      model = @config.litellm_claude_model_names.first
+      key   = @config.litellm_master_key
+
+      info "  Testing LiteLLM proxy at http://#{wg_ip}:#{port}/v1/messages..."
+      uri = URI("http://#{wg_ip}:#{port}/v1/messages")
+      req = Net::HTTP::Post.new(uri)
+      req['Content-Type'] = 'application/json'
+      req['x-api-key'] = key
+      req['anthropic-version'] = '2023-06-01'
+      req.body = JSON.generate(
+        'model' => model,
+        'max_tokens' => 50,
+        'messages' => [{ 'role' => 'user', 'content' => 'Say hello in five words.' }]
+      )
+      resp = Net::HTTP.start(uri.host, uri.port, open_timeout: 10, read_timeout: 120) { |h| h.request(req) }
+      raise Error, "LiteLLM returned HTTP #{resp.code}: #{resp.body}" unless resp.code == '200'
+
+      text = JSON.parse(resp.body).fetch('content', []).find { |b| b['type'] == 'text' }&.dig('text').to_s.strip
+      info "    LiteLLM response: #{text}"
+    rescue Errno::ECONNREFUSED, Errno::EHOSTUNREACH, SocketError => e
+      raise Error, "Cannot reach LiteLLM at #{wg_ip}:#{port} — is WireGuard (wg1) active? (#{e.message})"
+    end
+
+    # Sends a single OpenAI chat completion request and returns the reply text.
+    def vllm_chat(host, port, model, prompt)
+      uri = URI("http://#{host}:#{port}/v1/chat/completions")
+      req = Net::HTTP::Post.new(uri)
+      req['Content-Type'] = 'application/json'
+      req['Authorization'] = 'Bearer EMPTY'
+      req.body = JSON.generate(
+        'model' => model,
+        'messages' => [{ 'role' => 'user', 'content' => prompt }],
+        'max_tokens' => 50
+      )
+      resp = Net::HTTP.start(uri.host, uri.port, open_timeout: 10, read_timeout: 120) { |h| h.request(req) }
+      raise Error, "vLLM inference returned HTTP #{resp.code}" unless resp.code == '200'
+
+      JSON.parse(resp.body).dig('choices', 0, 'message', 'content').to_s.strip
+    end
+
+    def integer_or_nil(value)
+      value.nil? ? nil : Integer(value)
+    end
+
+    def print_local_wireguard_summary(expected_ip)
+      return unless @config.local_client_checks_enabled?
+
+      wg_status = @local_wireguard.status
+      endpoint = wg_status['endpoint']
+      info "Local WireGuard #{@config.local_interface_name}: #{wg_status['service_state']}"
+      if endpoint
+        info "Local WireGuard endpoint: #{endpoint}"
+        if expected_ip
+          host, = endpoint.split(':', 2)
+          if host == expected_ip
+            info 'Local WireGuard endpoint matches the managed VM IP.'
+          else
+            warn "Local WireGuard endpoint points to #{host}, expected #{expected_ip}."
+          end
+        end
+      else
+        warn "Unable to read #{@config.local_wg_config_path} for local WireGuard endpoint validation."
+      end
+    end
+
+    def info(message)
+      @out.puts(message)
+    end
+
+    def warn(message)
+      @out.puts("WARN: #{message}")
+    end
+  end
+
+  class CLI
+    def initialize(argv)
+      @argv = argv.dup
+    end
+
+    def run
+      global = {
+        config_path: File.join(__dir__, 'hyperstack-vm.toml')
+      }
+
+      global_parser = OptionParser.new do |opts|
+        opts.banner = 'Usage: ruby hyperstack.rb [--config path] <create|delete|status> [options]'
+        opts.on('--config PATH', "Path to TOML config (default: #{global[:config_path]})") do |value|
+          global[:config_path] = value
+        end
+        opts.on('-h', '--help', 'Show help') do
+          puts opts
+          puts
+          puts 'Commands:'
+          puts '  create [--replace] [--dry-run] [--vllm|--no-vllm] [--ollama|--no-ollama]'
+          puts '  delete [--vm-id ID] [--dry-run]'
+          puts '  status'
+          puts '  test'
+          exit 0
+        end
+      end
+      global_parser.order!(@argv)
+
+      command = @argv.shift
+      raise Error, 'Missing command. Use create, delete, or status.' if command.nil?
+
+      config = Config.load(global[:config_path])
+      state_store = StateStore.new(config.state_file)
+      client = HyperstackClient.new(base_url: config.api_base_url, api_key: config.api_key)
+      local_wireguard = LocalWireGuard.new(
+        interface_name: config.local_interface_name,
+        config_path: config.local_wg_config_path
+      )
+      manager = Manager.new(
+        config: config,
+        client: client,
+        state_store: state_store,
+        local_wireguard: local_wireguard
+      )
+
+      case command
+      when 'create'
+        replace = false
+        dry_run = false
+        install_vllm = nil
+        install_ollama = nil
+        parser = OptionParser.new do |opts|
+          opts.on('--replace', 'Delete the tracked VM before creating a new one') { replace = true }
+          opts.on('--dry-run', 'Resolve config and print the create plan without creating a VM') { dry_run = true }
+          opts.on('--vllm', 'Enable vLLM+LiteLLM setup (overrides config)') { install_vllm = true }
+          opts.on('--no-vllm', 'Disable vLLM+LiteLLM setup (overrides config)') { install_vllm = false }
+          opts.on('--ollama', 'Enable Ollama setup (overrides config)') { install_ollama = true }
+          opts.on('--no-ollama', 'Disable Ollama setup (overrides config)') { install_ollama = false }
+        end
+        parser.parse!(@argv)
+        manager.create(replace: replace, dry_run: dry_run, install_vllm: install_vllm, install_ollama: install_ollama)
+      when 'delete'
+        vm_id = nil
+        dry_run = false
+        parser = OptionParser.new do |opts|
+          opts.on('--vm-id ID', Integer, 'Delete a VM by ID instead of using the local state file') do |value|
+            vm_id = value
+          end
+          opts.on('--dry-run', 'Show which VM would be deleted without deleting it') { dry_run = true }
+        end
+        parser.parse!(@argv)
+        manager.delete(vm_id: vm_id, dry_run: dry_run)
+      when 'status'
+        manager.status
+      when 'test'
+        manager.test
+      else
+        raise Error, "Unknown command #{command.inspect}. Use create, delete, status, or test."
+      end
+    end
+  end
+end
+
+begin
+  HyperstackVM::CLI.new(ARGV).run
+rescue HyperstackVM::Error => e
+  warn "ERROR: #{e.message}"
+  exit 1
+end
diff --git a/snippets/hyperstack/hyperstack_vm.rb b/snippets/hyperstack/hyperstack_vm.rb
deleted file mode 100644
index ac60da9..0000000
--- a/snippets/hyperstack/hyperstack_vm.rb
+++ /dev/null
@@ -1,1418 +0,0 @@
-#!/usr/bin/env ruby
-# frozen_string_literal: true
-
-begin
-  require 'bundler/setup'
-rescue LoadError
-  nil
-rescue Gem::Exception => e
-  # Ruby can ship with a Bundler library version whose matching executable
-  # is not installed locally. Fall back to direct gem loading in that case.
-  raise unless e.is_a?(Gem::GemNotFoundException) || e.is_a?(Gem::LoadError)
-end
-
-require 'json'
-require 'net/http'
-require 'open3'
-require 'optparse'
-require 'ipaddr'
-require 'shellwords'
-require 'socket'
-require 'time'
-require 'timeout'
-
-begin
-  require 'toml-rb'
-rescue LoadError
-  warn "Missing dependency: toml-rb. Run `bundle install` in #{__dir__} first."
-  exit 2
-end
-
-module HyperstackVM
-  class Error < StandardError; end
-
-  class Config
-    DEFAULTS = {
-      'auth' => {
-        'api_key_file' => '~/.hyperstack'
-      },
-      'hyperstack' => {
-        'base_url' => 'https://infrahub-api.nexgencloud.com/v1'
-      },
-      'state' => {
-        'file' => '.hyperstack-vm-state.json'
-      },
-      'vm' => {
-        'name_prefix' => 'hyperstack',
-        'hostname' => 'hyperstack',
-        'flavor_name' => 'n3-A100x1',
-        'image_name' => 'Ubuntu Server 24.04 LTS R570 CUDA 12.8 with Docker',
-        'assign_floating_ip' => true,
-        'create_bootable_volume' => false,
-        'enable_port_randomization' => false,
-        'labels' => %w[gpt-oss-120b wireguard]
-      },
-      'ssh' => {
-        'username' => 'ubuntu',
-        'private_key_path' => '~/.ssh/id_rsa',
-        'hyperstack_key_name' => 'earth',
-        'port' => 22,
-        'connect_timeout_sec' => 10
-      },
-      'network' => {
-        'wireguard_udp_port' => 56_710,
-        'wireguard_subnet' => '192.168.3.0/24',
-        'ollama_port' => 11_434,
-        'allowed_ssh_cidrs' => ['0.0.0.0/0'],
-        'allowed_wireguard_cidrs' => ['0.0.0.0/0']
-      },
-      'bootstrap' => {
-        'enable_guest_bootstrap' => true,
-        'install_wireguard' => true,
-        'configure_ufw' => true,
-        'configure_ollama_host' => false
-      },
-      'ollama' => {
-        'install' => true,
-        'models_dir' => '/ephemeral/ollama/models',
-        'listen_host' => '0.0.0.0:11434',
-        'gpu_overhead_mb' => 2000,
-        'num_parallel' => 4,
-        'pull_models' => ['qwen3-coder:30b', 'gpt-oss:20b', 'gpt-oss:120b', 'nemotron-3-super']
-      },
-      'wireguard' => {
-        'auto_setup' => true,
-        'setup_script' => './wg1-setup.sh'
-      },
-      'local_client' => {
-        'check_wg1_service' => true,
-        'interface_name' => 'wg1',
-        'config_path' => '/etc/wireguard/wg1.conf'
-      }
-    }.freeze
-
-    attr_reader :path
-
-    def self.load(path)
-      expanded = File.expand_path(path)
-      raise Error, "Config file not found: #{expanded}" unless File.exist?(expanded)
-
-      raw = TomlRB.load_file(expanded)
-      new(raw, expanded)
-    rescue TomlRB::ParseError => e
-      raise Error, "Failed to parse TOML config #{expanded}: #{e.message}"
-    end
-
-    def initialize(raw, path)
-      @path = path
-      @data = deep_merge(DEFAULTS, raw || {})
-      validate!
-    end
-
-    def api_key
-      key_path = expand_path(fetch('auth', 'api_key_file'))
-      raise Error, "API key file not found: #{key_path}" unless File.exist?(key_path)
-
-      token = File.readlines(key_path, chomp: true).find { |line| !line.strip.empty? }&.strip
-      raise Error, "API key file is empty: #{key_path}" if token.nil? || token.empty?
-
-      token
-    rescue Errno::EACCES => e
-      raise Error, "Cannot read API key file #{key_path}: #{e.message}"
-    end
-
-    def api_base_url
-      fetch('hyperstack', 'base_url')
-    end
-
-    def state_file
-      expand_path(fetch('state', 'file'))
-    end
-
-    def environment_name
-      fetch('vm', 'environment_name')
-    end
-
-    def flavor_name
-      fetch('vm', 'flavor_name')
-    end
-
-    def image_name
-      fetch('vm', 'image_name')
-    end
-
-    def vm_name_prefix
-      fetch('vm', 'name_prefix')
-    end
-
-    def generated_vm_name
-      "#{vm_name_prefix}-#{Time.now.utc.strftime('%Y%m%d%H%M%S')}"
-    end
-
-    def vm_hostname
-      value = fetch('vm', 'hostname')
-      return nil if blank?(value)
-
-      value.to_s.downcase
-    end
-
-    def assign_floating_ip?
-      truthy?(fetch('vm', 'assign_floating_ip'))
-    end
-
-    def create_bootable_volume?
-      truthy?(fetch('vm', 'create_bootable_volume'))
-    end
-
-    def enable_port_randomization?
-      truthy?(fetch('vm', 'enable_port_randomization'))
-    end
-
-    def labels
-      Array(fetch('vm', 'labels')).map(&:to_s)
-    end
-
-    def user_data
-      custom = custom_user_data
-      return custom unless custom.nil? || custom.empty?
-      return nil if vm_hostname.nil?
-
-      default_hostname_cloud_init
-    rescue Errno::ENOENT => e
-      raise Error, "User data file not found: #{e.message}"
-    rescue Errno::EACCES => e
-      raise Error, "Cannot read user data file: #{e.message}"
-    end
-
-    def ssh_username
-      fetch('ssh', 'username')
-    end
-
-    def ssh_private_key_path
-      expand_path(fetch('ssh', 'private_key_path'))
-    end
-
-    def ssh_key_name
-      fetch('ssh', 'hyperstack_key_name')
-    end
-
-    def ssh_port
-      Integer(fetch('ssh', 'port'))
-    end
-
-    def ssh_connect_timeout
-      Integer(fetch('ssh', 'connect_timeout_sec'))
-    end
-
-    def wireguard_udp_port
-      Integer(fetch('network', 'wireguard_udp_port'))
-    end
-
-    def wireguard_subnet
-      fetch('network', 'wireguard_subnet')
-    end
-
-    def ollama_port
-      Integer(fetch('network', 'ollama_port'))
-    end
-
-    def allowed_ssh_cidrs
-      Array(fetch('network', 'allowed_ssh_cidrs')).map(&:to_s)
-    end
-
-    def allowed_wireguard_cidrs
-      Array(fetch('network', 'allowed_wireguard_cidrs')).map(&:to_s)
-    end
-
-    def guest_bootstrap_enabled?
-      truthy?(fetch('bootstrap', 'enable_guest_bootstrap'))
-    end
-
-    def install_wireguard?
-      truthy?(fetch('bootstrap', 'install_wireguard'))
-    end
-
-    def configure_ufw?
-      truthy?(fetch('bootstrap', 'configure_ufw'))
-    end
-
-    def configure_ollama_host?
-      truthy?(fetch('bootstrap', 'configure_ollama_host'))
-    end
-
-    def ollama_install_enabled?
-      truthy?(fetch('ollama', 'install'))
-    end
-
-    def ollama_models_dir
-      fetch('ollama', 'models_dir')
-    end
-
-    def ollama_listen_host
-      fetch('ollama', 'listen_host')
-    end
-
-    def ollama_gpu_overhead_mb
-      Integer(fetch('ollama', 'gpu_overhead_mb'))
-    end
-
-    def ollama_num_parallel
-      Integer(fetch('ollama', 'num_parallel'))
-    end
-
-    def ollama_pull_models
-      Array(fetch('ollama', 'pull_models')).map(&:to_s)
-    end
-
-    def local_client_checks_enabled?
-      truthy?(fetch('local_client', 'check_wg1_service'))
-    end
-
-    def local_interface_name
-      fetch('local_client', 'interface_name')
-    end
-
-    def local_wg_config_path
-      fetch('local_client', 'config_path')
-    end
-
-    def wireguard_auto_setup?
-      truthy?(fetch('wireguard', 'auto_setup'))
-    end
-
-    def wireguard_setup_script
-      expand_path(fetch('wireguard', 'setup_script'))
-    end
-
-    def desired_security_rules
-      rules = []
-
-      allowed_ssh_cidrs.each do |cidr|
-        rules << firewall_rule('tcp', ssh_port, cidr)
-      end
-
-      allowed_wireguard_cidrs.each do |cidr|
-        rules << firewall_rule('udp', wireguard_udp_port, cidr)
-      end
-
-      rules << firewall_rule('tcp', ollama_port, wireguard_subnet)
-      rules.uniq
-    end
-
-    private
-
-    def validate!
-      %w[auth hyperstack state vm ssh network bootstrap ollama wireguard local_client].each do |section|
-        raise Error, "Missing config section [#{section}]" unless @data.key?(section)
-      end
-
-      %w[environment_name flavor_name image_name].each do |key|
-        raise Error, "Missing [vm].#{key} in config #{path}" if blank?(dig('vm', key))
-      end
-
-      if vm_hostname && vm_hostname !~ /\A[a-z0-9](?:[a-z0-9-]{0,61}[a-z0-9])?\z/
-        raise Error, "Invalid [vm].hostname #{vm_hostname.inspect}; use lowercase letters, digits, and hyphens only."
-      end
-
-      %w[username hyperstack_key_name].each do |key|
-        raise Error, "Missing [ssh].#{key} in config #{path}" if blank?(dig('ssh', key))
-      end
-
-      [wireguard_subnet, *allowed_ssh_cidrs, *allowed_wireguard_cidrs].each do |cidr|
-        IPAddr.new(cidr)
-      rescue IPAddr::InvalidAddressError => e
-        raise Error, "Invalid CIDR #{cidr.inspect}: #{e.message}"
-      end
-    end
-
-    def firewall_rule(protocol, port, cidr)
-      ip = IPAddr.new(cidr)
-      {
-        'direction' => 'ingress',
-        'ethertype' => ip.ipv4? ? 'IPv4' : 'IPv6',
-        'protocol' => protocol,
-        'port_range_min' => port,
-        'port_range_max' => port,
-        'remote_ip_prefix' => cidr
-      }
-    end
-
-    def fetch(section, key)
-      dig(section, key)
-    end
-
-    def dig(*keys)
-      keys.reduce(@data) do |memo, key|
-        memo.is_a?(Hash) ? memo[key] : nil
-      end
-    end
-
-    def blank?(value)
-      value.nil? || value.to_s.strip.empty?
-    end
-
-    def truthy?(value)
-      value == true
-    end
-
-    def custom_user_data
-      inline = dig('vm', 'user_data')
-      return inline unless inline.nil? || inline.empty?
-
-      file = dig('vm', 'user_data_file')
-      return nil if file.nil? || file.empty?
-
-      File.read(expand_path(file))
-    end
-
-    def default_hostname_cloud_init
-      <<~CLOUD_INIT
-        #cloud-config
-        preserve_hostname: false
-        hostname: #{vm_hostname}
-      CLOUD_INIT
-    end
-
-    def expand_path(value)
-      return nil if value.nil?
-
-      string = value.to_s
-      return File.expand_path(string) if string.start_with?('~')
-      return string if string.start_with?('/')
-
-      File.expand_path(string, File.dirname(path))
-    end
-
-    def deep_merge(left, right)
-      left.merge(right) do |_key, old_value, new_value|
-        if old_value.is_a?(Hash) && new_value.is_a?(Hash)
-          deep_merge(old_value, new_value)
-        else
-          new_value
-        end
-      end
-    end
-  end
-
-  class StateStore
-    def initialize(path)
-      @path = path
-    end
-
-    attr_reader :path
-
-    def load
-      return nil unless File.exist?(@path)
-
-      JSON.parse(File.read(@path))
-    rescue JSON::ParserError => e
-      raise Error, "Failed to parse state file #{@path}: #{e.message}"
-    end
-
-    def save(payload)
-      temp_path = "#{@path}.tmp"
-      File.write(temp_path, JSON.pretty_generate(payload))
-      File.rename(temp_path, @path)
-    end
-
-    def delete
-      File.delete(@path) if File.exist?(@path)
-    end
-  end
-
-  class HyperstackClient
-    def initialize(base_url:, api_key:)
-      @base_uri = URI(base_url)
-      @api_key = api_key
-    end
-
-    def list_environments
-      response = request(:get, '/core/environments')
-      response.fetch('environments', [])
-    end
-
-    def list_keypairs
-      response = request(:get, '/core/keypairs')
-      response.fetch('keypairs', [])
-    end
-
-    def list_flavors
-      response = request(:get, '/core/flavors')
-      Array(response['data']).flat_map do |entry|
-        Array(entry['flavors']).map do |flavor|
-          flavor.merge(
-            'region_name' => flavor['region_name'] || entry['region_name'],
-            'gpu' => flavor['gpu'] || entry['gpu']
-          )
-        end
-      end
-    end
-
-    def list_images
-      response = request(:get, '/core/images')
-      Array(response['images']).flat_map do |entry|
-        Array(entry['images']).map do |image|
-          image.merge(
-            'region_name' => image['region_name'] || entry['region_name'],
-            'type' => image['type'] || entry['type']
-          )
-        end
-      end
-    end
-
-    def list_vms
-      response = request(:get, '/core/virtual-machines')
-      response.fetch('instances', [])
-    end
-
-    def get_vm(vm_id)
-      response = request(:get, "/core/virtual-machines/#{vm_id}")
-      response.fetch('instance', nil)
-    end
-
-    def create_vm(payload)
-      request(:post, '/core/virtual-machines', payload)
-    end
-
-    def delete_vm(vm_id)
-      request(:delete, "/core/virtual-machines/#{vm_id}")
-    end
-
-    def create_vm_rule(vm_id, payload)
-      request(:post, "/core/virtual-machines/#{vm_id}/sg-rules", payload)
-    end
-
-    private
-
-    def request(method, path, payload = nil)
-      uri = @base_uri.dup
-      uri.path = "#{@base_uri.path}#{path}"
-
-      request = case method
-                when :get
-                  Net::HTTP::Get.new(uri)
-                when :post
-                  Net::HTTP::Post.new(uri)
-                when :delete
-                  Net::HTTP::Delete.new(uri)
-                else
-                  raise Error, "Unsupported HTTP method: #{method}"
-                end
-
-      request['accept'] = 'application/json'
-      request['api_key'] = @api_key
-      if payload
-        request['content-type'] = 'application/json'
-        request.body = JSON.generate(payload)
-      end
-
-      retries_left = 4
-      begin
-        response = Net::HTTP.start(
-          uri.host,
-          uri.port,
-          use_ssl: uri.scheme == 'https',
-          open_timeout: 30,
-          read_timeout: 120
-        ) { |http| http.request(request) }
-
-        parse_response(response)
-      rescue Timeout::Error, Errno::ECONNREFUSED, Errno::ECONNRESET,
-             Errno::EHOSTUNREACH, Errno::ENETUNREACH,
-             SocketError, OpenSSL::SSL::SSLError, Net::OpenTimeout => e
-        raise Error, "Hyperstack API request failed for #{path}: #{e.message}" if retries_left <= 0
-
-        retries_left -= 1
-        delay = (4 - retries_left) * 5
-        warn "API request to #{path} failed (#{e.class}: #{e.message}), retrying in #{delay}s (#{retries_left} left)..."
-        sleep delay
-        retry
-      end
-    end
-
-    def parse_response(response)
-      body = response.body.to_s
-      payload = body.empty? ? {} : JSON.parse(body)
-
-      if response.code.to_i >= 400 || payload['status'] == false
-        message = payload['message'] || payload['error_reason'] || response.message
-        raise Error, "Hyperstack API error (HTTP #{response.code}): #{message}"
-      end
-
-      payload
-    rescue JSON::ParserError => e
-      raise Error, "Failed to parse Hyperstack API response: #{e.message}"
-    end
-  end
-
-  class LocalWireGuard
-    def initialize(interface_name:, config_path:)
-      @interface_name = interface_name
-      @config_path = config_path
-    end
-
-    def status
-      {
-        'service_state' => service_state,
-        'config_path' => @config_path,
-        'endpoint' => configured_endpoint,
-        'config_readable' => !config_contents.nil?
-      }
-    end
-
-    private
-
-    def service_state
-      stdout, _stderr, status = Open3.capture3('systemctl', 'is-active', "wg-quick@#{@interface_name}")
-      value = stdout.to_s.strip
-      return value unless value.empty?
-      return 'active' if status.success?
-
-      'unknown'
-    end
-
-    def configured_endpoint
-      content = config_contents
-      return nil if content.nil?
-
-      parse_wireguard_config(content)['Endpoint']
-    end
-
-    def config_contents
-      return @config_contents if defined?(@config_contents)
-
-      @config_contents = File.read(@config_path)
-    rescue Errno::EACCES, Errno::ENOENT
-      stdout, _stderr, status = Open3.capture3('sudo', '-n', 'cat', @config_path)
-      @config_contents = status.success? ? stdout : nil
-    end
-
-    def parse_wireguard_config(content)
-      current_section = nil
-      peer = {}
-
-      content.each_line do |line|
-        stripped = line.strip
-        next if stripped.empty? || stripped.start_with?('#')
-
-        if stripped.start_with?('[') && stripped.end_with?(']')
-          current_section = stripped[1..-2]
-          next
-        end
-
-        key, value = stripped.split('=', 2).map { |part| part&.strip }
-        next unless current_section == 'Peer' && key && value
-
-        peer[key] = value
-      end
-
-      peer
-    end
-  end
-
-  class Manager
-    def initialize(config:, client:, state_store:, local_wireguard:, out: $stdout)
-      @config = config
-      @client = client
-      @state_store = state_store
-      @local_wireguard = local_wireguard
-      @out = out
-    end
-
-    def create(replace: false, dry_run: false)
-      existing_state = @state_store.load
-      if existing_state && existing_state['vm_id']
-        if replace
-          if dry_run
-            info "DRY RUN: would delete tracked VM #{existing_state['vm_id']} before creating a replacement."
-          else
-            delete(vm_id: existing_state['vm_id'], preserve_state_on_failure: true)
-          end
-        elsif resumable_state?(existing_state)
-          if dry_run
-            print_resume_dry_run(existing_state)
-            return
-          end
-
-          info "Resuming tracked VM #{existing_state['vm_id']} provisioning..."
-          continue_create(existing_state)
-          return
-        else
-          raise Error,
-                "State file #{@state_store.path} already tracks VM #{existing_state['vm_id']}. Use --replace or delete first."
-        end
-      end
-
-      resolved = resolve_dependencies
-      vm_name = @config.generated_vm_name
-      if dry_run
-        info "Planning VM #{vm_name} in #{resolved[:environment]['name']} using #{@config.flavor_name}..."
-      else
-        info "Creating VM #{vm_name} in #{resolved[:environment]['name']} using #{@config.flavor_name}..."
-      end
-
-      payload = build_create_payload(vm_name, resolved)
-      if dry_run
-        print_create_dry_run(vm_name, resolved, payload)
-        return
-      end
-
-      response = @client.create_vm(payload)
-      instance = Array(response['instances']).first
-      raise Error, 'Hyperstack create response did not include an instance ID.' unless instance && instance['id']
-
-      state = {
-        'vm_id' => instance['id'],
-        'vm_name' => vm_name,
-        'environment_name' => resolved[:environment]['name'],
-        'region' => resolved[:environment]['region'],
-        'flavor_name' => resolved[:flavor]['name'],
-        'image_name' => resolved[:image]['name'],
-        'key_name' => resolved[:keypair]['name'],
-        'public_ip' => instance['floating_ip'],
-        'created_at' => Time.now.utc.iso8601
-      }
-      @state_store.save(state)
-      continue_create(state)
-    end
-
-    def delete(vm_id: nil, preserve_state_on_failure: false, dry_run: false)
-      state = @state_store.load
-      target_vm_id = vm_id || state&.dig('vm_id')
-      raise Error, "No VM ID provided and no state file found at #{@state_store.path}." if target_vm_id.nil?
-
-      if dry_run
-        print_delete_dry_run(target_vm_id, state, preserve_state_on_failure: preserve_state_on_failure)
-        return
-      end
-
-      info "Deleting VM #{target_vm_id}..."
-      @client.delete_vm(target_vm_id)
-      wait_for_deletion(target_vm_id)
-      @state_store.delete unless preserve_state_on_failure
-      info "VM #{target_vm_id} deleted."
-    rescue Error
-      raise if preserve_state_on_failure
-
-      @state_store.delete
-      raise
-    end
-
-    def status
-      state = @state_store.load
-      if state.nil?
-        info "No tracked VM state file at #{@state_store.path}."
-      else
-        begin
-          vm = @client.get_vm(state['vm_id'])
-          desired = @config.desired_security_rules.map { |rule| normalize_rule(rule) }
-          current = Array(vm['security_rules']).map { |rule| normalize_rule(rule) }
-          missing_rules = desired - current
-
-          info "Tracked VM: #{state['vm_id']} #{vm['name']}"
-          info "Status: #{vm['status']} / #{vm['vm_state']}"
-          info "Public IP: #{connect_host_for(vm) || 'none'}"
-          info "Missing firewall rules: #{missing_rules.empty? ? 'none' : missing_rules.size}"
-        rescue Error => e
-          warn "Unable to load VM #{state['vm_id']}: #{e.message}"
-        end
-      end
-
-      print_local_wireguard_summary(state&.dig('public_ip'))
-    end
-
-    private
-
-    def resumable_state?(state)
-      state['vm_id'] && (state['bootstrapped_at'].nil? || ollama_setup_needed?(state) || wireguard_setup_needed?(state))
-    end
-
-    def continue_create(state)
-      vm_id = state['vm_id']
-
-      vm = wait_for_vm_ready(vm_id)
-      ensure_security_rules(vm)
-      vm = wait_for_connect_ip(vm_id)
-      state['public_ip'] = connect_host_for(vm)
-      state['security_rules'] = Array(vm['security_rules']).map { |rule| normalize_rule(rule) }
-      @state_store.save(state)
-
-      wait_for_ssh(state['public_ip'])
-      if @config.guest_bootstrap_enabled? && state['bootstrapped_at'].nil?
-        bootstrap_guest(state['public_ip'])
-        state['bootstrapped_at'] = Time.now.utc.iso8601
-        @state_store.save(state)
-      end
-
-      # Install Ollama binary and configure the service (fast), but defer
-      # model pulls until after the WireGuard tunnel is up so that the user
-      # can monitor progress over the tunnel.
-      if @config.ollama_install_enabled? && state['ollama_installed_at'].nil?
-        install_ollama_service(state['public_ip'])
-        state['ollama_installed_at'] = Time.now.utc.iso8601
-        @state_store.save(state)
-      end
-
-      if wireguard_setup_needed?(state)
-        run_wireguard_setup(state['public_ip'])
-        state['wireguard_setup_at'] = Time.now.utc.iso8601
-        @state_store.save(state)
-      end
-
-      # Pull and verify models after the tunnel is established
-      if ollama_setup_needed?(state)
-        pull_ollama_models(state['public_ip'])
-        state['ollama_setup_at'] = Time.now.utc.iso8601
-        state['ollama_models_dir'] = @config.ollama_models_dir
-        state['ollama_pulled_models'] = desired_ollama_models
-        @state_store.save(state)
-      end
-
-      vm = @client.get_vm(vm_id)
-      state['security_rules'] = Array(vm['security_rules']).map { |rule| normalize_rule(rule) }
-      state['status'] = vm['status']
-      state['vm_state'] = vm['vm_state']
-      state['provisioned_at'] = Time.now.utc.iso8601
-      @state_store.save(state)
-
-      info "VM ready: #{state['public_ip']} (id=#{state['vm_id']})"
-      print_local_wireguard_summary(state['public_ip'])
-    end
-
-    def build_create_payload(vm_name, resolved)
-      payload = {
-        'name' => vm_name,
-        'count' => 1,
-        'environment_name' => resolved[:environment]['name'],
-        'flavor_name' => resolved[:flavor]['name'],
-        'image_name' => resolved[:image]['name'],
-        'key_name' => resolved[:keypair]['name'],
-        'assign_floating_ip' => @config.assign_floating_ip?,
-        'create_bootable_volume' => @config.create_bootable_volume?,
-        'enable_port_randomization' => @config.enable_port_randomization?,
-        'security_rules' => @config.desired_security_rules
-      }
-      payload['labels'] = @config.labels unless @config.labels.empty?
-      payload['user_data'] = @config.user_data if @config.user_data
-      payload
-    end
-
-    def resolve_dependencies
-      environment = @client.list_environments.find { |item| item['name'] == @config.environment_name }
-      raise Error, "Environment #{@config.environment_name.inspect} was not found in Hyperstack." unless environment
-
-      flavor = @client.list_flavors.find do |item|
-        item['name'] == @config.flavor_name && item['region_name'] == environment['region']
-      end
-      raise Error, "Flavor #{@config.flavor_name.inspect} is not available in #{environment['region']}." unless flavor
-
-      if flavor['stock_available'] == false
-        raise Error,
-              "Flavor #{@config.flavor_name.inspect} exists in #{environment['region']} but is out of stock."
-      end
-
-      image = @client.list_images.find do |item|
-        item['name'] == @config.image_name && item['region_name'] == environment['region']
-      end
-      raise Error, "Image #{@config.image_name.inspect} is not available in #{environment['region']}." unless image
-
-      keypair = @client.list_keypairs.find do |item|
-        item['name'] == @config.ssh_key_name && item.dig('environment', 'name') == environment['name']
-      end
-      unless keypair
-        raise Error,
-              "Keypair #{@config.ssh_key_name.inspect} was not found in environment #{environment['name']}."
-      end
-
-      {
-        environment: environment,
-        flavor: flavor,
-        image: image,
-        keypair: keypair
-      }
-    end
-
-    def wait_for_vm_ready(vm_id)
-      with_polling("VM #{vm_id} to become ready for firewall updates") do
-        vm = @client.get_vm(vm_id)
-        next nil if vm.nil?
-
-        raise Error, "VM #{vm_id} entered failed state #{vm['status']} / #{vm['vm_state']}." if failed_vm?(vm)
-
-        vm_ready_for_updates?(vm) ? vm : nil
-      end
-    end
-
-    def wait_for_connect_ip(vm_id)
-      ip_label = @config.assign_floating_ip? ? 'floating IP' : 'reachable IP'
-      with_polling("VM #{vm_id} to receive a #{ip_label}") do
-        vm = @client.get_vm(vm_id)
-        raise Error, "VM #{vm_id} entered failed state #{vm['status']} / #{vm['vm_state']}." if failed_vm?(vm)
-
-        connect_host_for(vm) ? vm : nil
-      end
-    end
-
-    def wait_for_ssh(host)
-      # Remove stale host key for this IP — VMs frequently reuse IPs after
-      # delete/recreate, causing StrictHostKeyChecking to reject the new key
-      remove_stale_host_key(host)
-      info "Waiting for SSH on #{host}:#{@config.ssh_port}..."
-      with_polling("SSH on #{host}:#{@config.ssh_port}") do
-        next nil unless tcp_open?(host, @config.ssh_port)
-
-        stdout, stderr, status = run_ssh_command(host, 'true')
-        if status.success?
-          true
-        else
-          warn "SSH not ready yet: #{stderr.strip}" unless stderr.to_s.strip.empty?
-          nil
-        end
-      end
-    end
-
-    def ensure_security_rules(vm)
-      existing = Array(vm['security_rules']).map { |rule| normalize_rule(rule) }
-      desired = @config.desired_security_rules.map { |rule| normalize_rule(rule) }
-
-      (desired - existing).each do |rule|
-        info "Adding Hyperstack firewall rule #{rule['protocol']} #{rule['remote_ip_prefix']} #{rule['port_range_min']}..."
-        @client.create_vm_rule(vm['id'], rule)
-      end
-    end
-
-    def bootstrap_guest(host)
-      info 'Bootstrapping Ubuntu guest over SSH...'
-      retries = 3
-      retries.times do |attempt|
-        stdout, stderr, status = run_ssh_command(host, guest_bootstrap_script)
-        return if status.success?
-
-        msg = stderr.strip.empty? ? stdout : stderr
-        raise Error, "Guest bootstrap failed after #{retries} attempts: #{msg}" if attempt == retries - 1
-
-        warn "Bootstrap attempt #{attempt + 1}/#{retries} failed (#{msg.lines.last&.strip}), retrying in 15s..."
-        sleep 15
-      end
-    end
-
-    def ollama_setup_needed?(state)
-      return false unless @config.ollama_install_enabled?
-      # Re-run setup if state has no record, or if desired models changed
-      return true if state['ollama_setup_at'].nil?
-
-      model_list_signature(desired_ollama_models) != model_list_signature(state['ollama_pulled_models'])
-    end
-
-    def install_ollama_service(host)
-      info "Installing and configuring Ollama on #{host}..."
-      output, status = run_ssh_command_streaming(host, ollama_install_script)
-      raise Error, "Ollama install failed: #{output.strip}" unless status.success?
-    end
-
-    def pull_ollama_models(host)
-      info "Pulling Ollama models on #{host}..."
-      output, status = run_ssh_command_streaming(host, ollama_pull_script)
-      raise Error, "Ollama model pull failed: #{output.strip}" unless status.success?
-
-      # Verify all models are actually present on the remote (belt-and-suspenders
-      # check in case ollama pull returned 0 without actually pulling the model)
-      verify_remote_models(host)
-    end
-
-    def verify_remote_models(host)
-      stdout, _stderr, status = run_ssh_command(host, 'ollama list')
-      return unless status.success?
-
-      remote_models = stdout.lines.drop(1).map { |l| l.split.first }.compact
-      missing = desired_ollama_models.reject { |m| remote_models.any? { |r| r.start_with?(m) } }
-      return if missing.empty?
-
-      raise Error, "Models missing after setup: #{missing.join(', ')}. Remote has: #{remote_models.join(', ')}"
-    end
-
-    def wireguard_setup_needed?(state)
-      return false unless @config.wireguard_auto_setup?
-
-      public_ip = state['public_ip'].to_s.strip
-      return true if public_ip.empty?
-
-      expected_endpoint = "#{public_ip}:#{@config.wireguard_udp_port}"
-      @local_wireguard.status['endpoint'] != expected_endpoint
-    end
-
-    def run_wireguard_setup(host)
-      validate_wireguard_setup_script!
-      retries = 3
-      retries.times do |attempt|
-        info "Running WireGuard auto-setup via #{@config.wireguard_setup_script} #{host}..."
-
-        status = run_wireguard_script(host)
-        return if status.success?
-
-        if attempt == retries - 1
-          raise Error, "WireGuard setup failed after #{retries} attempts (exit #{status.exitstatus})."
-        end
-
-        delay = (attempt + 1) * 15
-        warn "WireGuard setup attempt #{attempt + 1}/#{retries} failed (exit #{status.exitstatus}), retrying in #{delay}s..."
-        sleep delay
-      end
-    end
-
-    def run_wireguard_script(host)
-      Open3.popen2e('bash', @config.wireguard_setup_script, host) do |stdin, output, wait_thr|
-        stdin.sync = true
-        stdin.puts
-        stdin.close
-
-        output.each { |line| @out.print(line) }
-        wait_thr.value
-      end
-    end
-
-    def wait_for_deletion(vm_id)
-      info "Waiting for VM #{vm_id} deletion to complete..."
-      with_polling("VM #{vm_id} deletion", timeout: 300) do
-        @client.get_vm(vm_id)
-        nil
-      rescue Error => e
-        raise unless e.message.include?('not_found') || e.message.include?('does not exists')
-
-        true
-      end
-    end
-
-    def connect_host_for(vm)
-      return vm['floating_ip'] if @config.assign_floating_ip?
-
-      vm['floating_ip'] || vm['fixed_ip']
-    end
-
-    def validate_wireguard_setup_script!
-      script_path = @config.wireguard_setup_script
-      raise Error, "WireGuard setup script not found: #{script_path}" unless File.exist?(script_path)
-
-      mismatches = []
-      mismatches << "ssh.username must be 'ubuntu'" unless @config.ssh_username == 'ubuntu'
-      mismatches << "local_client.interface_name must be 'wg1'" unless @config.local_interface_name == 'wg1'
-      mismatches << 'network.wireguard_udp_port must be 56710' unless @config.wireguard_udp_port == 56_710
-      mismatches << "network.wireguard_subnet must be '192.168.3.0/24'" unless @config.wireguard_subnet == '192.168.3.0/24'
-
-      return if mismatches.empty?
-
-      raise Error, "Configured WireGuard settings do not match #{script_path}: #{mismatches.join('; ')}"
-    end
-
-    def remove_stale_host_key(host)
-      system('ssh-keygen', '-R', host, out: File::NULL, err: File::NULL)
-      # Also remove bracketed form for non-standard ports
-      if @config.ssh_port != 22
-        system('ssh-keygen', '-R', "[#{host}]:#{@config.ssh_port}", out: File::NULL, err: File::NULL)
-      end
-    end
-
-    def failed_vm?(vm)
-      [vm['status'], vm['vm_state'], vm['power_state']].compact.any? do |value|
-        value.to_s.downcase.match?(/error|failed|deleted|shelved/)
-      end
-    end
-
-    def vm_ready_for_updates?(vm)
-      %w[ACTIVE SHUTOFF HIBERNATED].include?(vm['status'].to_s.upcase)
-    end
-
-    def tcp_open?(host, port)
-      Socket.tcp(host, port, connect_timeout: @config.ssh_connect_timeout) do |sock|
-        sock.close
-        true
-      end
-    rescue Errno::ECONNREFUSED, Errno::ETIMEDOUT, Errno::EHOSTUNREACH, Errno::ENETUNREACH, SocketError, IOError
-      false
-    end
-
-    def run_ssh_command(host, remote_script)
-      Open3.capture3(*ssh_command(host), stdin_data: remote_script)
-    end
-
-    def run_ssh_command_streaming(host, remote_script)
-      combined_output = +''
-      Open3.popen2e(*ssh_command(host)) do |stdin, output, wait_thr|
-        stdin.write(remote_script)
-        stdin.close
-
-        output.each do |line|
-          combined_output << line
-          @out.print(line)
-        end
-
-        return [combined_output, wait_thr.value]
-      end
-    end
-
-    def ssh_command(host)
-      command = [
-        'ssh',
-        '-o', 'BatchMode=yes',
-        '-o', 'StrictHostKeyChecking=accept-new',
-        '-o', "ConnectTimeout=#{@config.ssh_connect_timeout}",
-        '-p', @config.ssh_port.to_s
-      ]
-      if File.exist?(@config.ssh_private_key_path)
-        command.concat(['-i', @config.ssh_private_key_path])
-      else
-        warn "SSH private key #{@config.ssh_private_key_path} does not exist; falling back to default ssh-agent identity."
-      end
-
-      command << "#{@config.ssh_username}@#{host}"
-      command << 'bash -se'
-      command
-    end
-
-    def with_polling(description, timeout: 900, interval: 5)
-      deadline = Time.now + timeout
-      loop do
-        result = yield
-        return result if result
-
-        raise Error, "Timed out waiting for #{description}." if Time.now >= deadline
-
-        sleep interval
-      end
-    end
-
-    def normalize_rule(rule)
-      {
-        'direction' => rule['direction'].to_s.downcase,
-        'ethertype' => rule['ethertype'].to_s,
-        'protocol' => rule['protocol'].to_s.downcase,
-        'port_range_min' => integer_or_nil(rule['port_range_min']),
-        'port_range_max' => integer_or_nil(rule['port_range_max']),
-        'remote_ip_prefix' => rule['remote_ip_prefix'].to_s
-      }
-    end
-
-    def print_create_dry_run(vm_name, resolved, payload)
-      info 'DRY RUN: no VM or state file will be created.'
-      info "State file: #{@state_store.path}"
-      info "Resolved environment: #{resolved[:environment]['name']} (region #{resolved[:environment]['region']})"
-      info "Resolved flavor: #{format_flavor(resolved[:flavor])}"
-      info "Resolved image: #{resolved[:image]['name']}"
-      info "Resolved SSH keypair: #{resolved[:keypair]['name']}"
-      info "Planned VM name: #{vm_name}"
-      info 'Create payload:'
-      @out.puts(JSON.pretty_generate(payload))
-      if @config.guest_bootstrap_enabled?
-        info 'Guest bootstrap script:'
-        @out.puts(guest_bootstrap_script)
-      else
-        info 'Guest bootstrap is disabled in config.'
-      end
-      if @config.ollama_install_enabled?
-        info "Ollama will be installed with models stored under #{@config.ollama_models_dir}"
-        unless desired_ollama_models.empty?
-          info "Ollama models to pre-pull: #{desired_ollama_models.join(', ')}"
-        end
-      end
-      if @config.wireguard_auto_setup?
-        info "WireGuard auto-setup script: #{@config.wireguard_setup_script} <vm_public_ip>"
-      end
-      print_local_wireguard_summary(nil)
-    end
-
-    def print_resume_dry_run(state)
-      info "DRY RUN: would resume provisioning tracked VM #{state['vm_id']}."
-      begin
-        vm = @client.get_vm(state['vm_id'])
-        info "Tracked VM status: #{vm['status']} / #{vm['vm_state']}"
-        info "Tracked VM public IP: #{connect_host_for(vm) || 'none'}"
-      rescue Error => e
-        warn "Unable to inspect tracked VM #{state['vm_id']}: #{e.message}"
-      end
-      if @config.guest_bootstrap_enabled?
-        info 'Guest bootstrap script:'
-        @out.puts(guest_bootstrap_script)
-      end
-      if ollama_setup_needed?(state)
-        info "Ollama would be installed with models stored under #{@config.ollama_models_dir}"
-        unless desired_ollama_models.empty?
-          info "Ollama models to pre-pull: #{desired_ollama_models.join(', ')}"
-        end
-      end
-      if wireguard_setup_needed?(state)
-        info "WireGuard auto-setup script would run: #{@config.wireguard_setup_script} #{state['public_ip'] || '<pending-public-ip>'}"
-      end
-      print_local_wireguard_summary(state['public_ip'])
-    end
-
-    def print_delete_dry_run(target_vm_id, state, preserve_state_on_failure:)
-      info 'DRY RUN: no VM will be deleted.'
-      begin
-        vm = @client.get_vm(target_vm_id)
-        info "Delete target: #{target_vm_id} #{vm['name']} (#{vm['status']} / #{vm['vm_state']})"
-        info "Delete target public IP: #{connect_host_for(vm) || 'none'}"
-      rescue Error => e
-        warn "Unable to inspect VM #{target_vm_id} before delete: #{e.message}"
-      end
-
-      if state && state['vm_id'].to_i == target_vm_id.to_i
-        action = preserve_state_on_failure ? 'would remain unchanged' : 'would be removed'
-        info "Tracked state file #{@state_store.path} #{action}."
-      else
-        info 'No tracked state entry would be modified.'
-      end
-    end
-
-    def format_flavor(flavor)
-      gpu = flavor['gpu'].to_s.empty? ? 'CPU-only' : flavor['gpu']
-      [
-        flavor['name'],
-        gpu,
-        "#{flavor['gpu_count']} GPU",
-        "#{flavor['ram']} GB RAM",
-        "#{flavor['cpu']} vCPU",
-        "stock=#{flavor['stock_available']}"
-      ].join(', ')
-    end
-
-    def guest_bootstrap_script
-      script = []
-      script << 'set -euo pipefail'
-
-      # Wait for any running unattended-upgrades or apt locks to release
-      # before attempting package operations (transient lock on fresh VMs)
-      script << 'echo "Waiting for apt locks to clear..."'
-      script << 'for i in $(seq 1 30); do'
-      script << '  if ! fuser /var/lib/dpkg/lock-frontend /var/lib/apt/lists/lock /var/cache/apt/archives/lock >/dev/null 2>&1; then break; fi'
-      script << '  echo "  apt lock held, waiting ($i/30)..."; sleep 10'
-      script << 'done'
-      script << 'sudo systemctl stop unattended-upgrades.service 2>/dev/null || true'
-      script << 'sudo systemctl disable unattended-upgrades.service 2>/dev/null || true'
-
-      if @config.install_wireguard?
-        script << 'which wg >/dev/null 2>&1 || (sudo apt-get update && sudo apt-get install -y wireguard)'
-      end
-
-      if @config.configure_ufw?
-        script << "sudo ufw allow #{@config.ssh_port}/tcp comment 'Allow SSH' >/dev/null 2>&1 || true"
-        script << 'sudo ufw --force enable >/dev/null 2>&1 || true'
-        script << "sudo ufw allow #{@config.wireguard_udp_port}/udp comment 'WireGuard #{@config.local_interface_name}' >/dev/null 2>&1 || true"
-        script << "sudo ufw allow from #{Shellwords.escape(@config.wireguard_subnet)} to any port #{@config.ollama_port} proto tcp comment 'Ollama via #{@config.local_interface_name}' >/dev/null 2>&1 || true"
-      end
-
-      if @config.configure_ollama_host?
-        # Only write a minimal OLLAMA_HOST override if no override exists yet;
-        # ollama_setup_script writes the full override (OLLAMA_MODELS, GPU_OVERHEAD, etc.)
-        script << "if systemctl list-unit-files | grep -q '^ollama.service'; then"
-        script << '  if [ ! -f /etc/systemd/system/ollama.service.d/override.conf ]; then'
-        script << '    sudo mkdir -p /etc/systemd/system/ollama.service.d'
-        script << "    cat <<'OVERRIDE' | sudo tee /etc/systemd/system/ollama.service.d/override.conf >/dev/null"
-        script << '[Service]'
-        script << "Environment=\"OLLAMA_HOST=0.0.0.0:#{@config.ollama_port}\""
-        script << 'OVERRIDE'
-        script << '    sudo systemctl daemon-reload'
-        script << '    sudo systemctl restart ollama || true'
-        script << '  fi'
-        script << 'fi'
-      end
-
-      script << 'echo bootstrap-ok'
-      script.join("\n")
-    end
-
-    def desired_ollama_models
-      normalized_model_list(@config.ollama_pull_models)
-    end
-
-    def normalized_model_list(models)
-      Array(models).each_with_object([]) do |model, ordered|
-        normalized = model.to_s.strip
-        next if normalized.empty? || ordered.include?(normalized)
-
-        ordered << normalized
-      end
-    end
-
-    def model_list_signature(models)
-      normalized_model_list(models).sort
-    end
-
-    # Installs the Ollama binary, configures the systemd override (models dir,
-    # listen host, GPU overhead, parallelism), and starts the service.  Model
-    # pulls are handled separately by ollama_pull_script so that the WireGuard
-    # tunnel can be established first.
-    def ollama_install_script
-      models_dir = @config.ollama_models_dir
-      listen_host = @config.ollama_listen_host
-
-      script = []
-      script << 'set -euo pipefail'
-      script << 'sudo pkill -f unattended-upgrade >/dev/null 2>&1 || true'
-      script << "if ! command -v ollama >/dev/null 2>&1; then curl -fsSL https://ollama.ai/install.sh | sh; fi"
-      if models_dir.start_with?('/ephemeral')
-        script << "mountpoint -q /ephemeral || { echo 'Expected /ephemeral mount is missing'; exit 1; }"
-      end
-      script << "sudo mkdir -p #{Shellwords.escape(models_dir)}"
-      script << "sudo chown -R ollama:ollama #{Shellwords.escape(File.dirname(models_dir))}"
-      script << 'sudo mkdir -p /etc/systemd/system/ollama.service.d'
-      script << "cat <<'OVERRIDE' | sudo tee /etc/systemd/system/ollama.service.d/override.conf >/dev/null"
-      script << '[Service]'
-      script << "Environment=\"OLLAMA_MODELS=#{models_dir}\""
-      script << "Environment=\"OLLAMA_GPU_OVERHEAD=#{@config.ollama_gpu_overhead_mb}\""
-      script << "Environment=\"OLLAMA_NUM_PARALLEL=#{@config.ollama_num_parallel}\""
-      script << "Environment=\"OLLAMA_HOST=#{listen_host}\""
-      script << 'OVERRIDE'
-      script << 'sudo systemctl daemon-reload'
-      script << 'sudo systemctl enable --now ollama'
-      script << 'sudo systemctl restart ollama'
-      script << 'sleep 3'
-      script << 'systemctl is-active --quiet ollama'
-      script << 'echo ollama-install-ok'
-      script.join("\n")
-    end
-
-    # Pulls each configured model with retry and per-model + final verification.
-    # Run after WireGuard is up so the user can monitor progress over the tunnel.
-    def ollama_pull_script
-      models_dir = @config.ollama_models_dir
-      model_pulls = desired_ollama_models
-
-      script = []
-      script << 'set -euo pipefail'
-      # Pull each model with retry (transient network failures) and verify
-      # it is actually present afterwards
-      model_pulls.each do |model|
-        escaped = Shellwords.escape(model)
-        script << "echo \"Pulling model #{model}...\""
-        script << "for attempt in 1 2 3; do"
-        script << "  if ollama pull #{escaped}; then break; fi"
-        script << "  if [ \"$attempt\" -eq 3 ]; then echo \"FATAL: failed to pull #{model} after 3 attempts\"; exit 1; fi"
-        script << "  echo \"  pull attempt $attempt failed, retrying in 15s...\"; sleep 15"
-        script << "done"
-        script << "ollama show #{escaped} --modelfile >/dev/null 2>&1 || { echo \"FATAL: model #{model} not found after pull\"; exit 1; }"
-      end
-      # Final verification: ensure all expected models are listed
-      script << 'echo "Verifying all models are present..."'
-      model_pulls.each do |model|
-        escaped = Shellwords.escape(model)
-        script << "ollama show #{escaped} --modelfile >/dev/null 2>&1 || { echo \"FATAL: model #{model} missing in final check\"; exit 1; }"
-      end
-      script << "echo ollama-models-dir=#{models_dir}"
-      script << 'echo ollama-ok'
-      script.join("\n")
-    end
-
-    def integer_or_nil(value)
-      value.nil? ? nil : Integer(value)
-    end
-
-    def print_local_wireguard_summary(expected_ip)
-      return unless @config.local_client_checks_enabled?
-
-      wg_status = @local_wireguard.status
-      endpoint = wg_status['endpoint']
-      info "Local WireGuard #{@config.local_interface_name}: #{wg_status['service_state']}"
-      if endpoint
-        info "Local WireGuard endpoint: #{endpoint}"
-        if expected_ip
-          host, = endpoint.split(':', 2)
-          if host == expected_ip
-            info 'Local WireGuard endpoint matches the managed VM IP.'
-          else
-            warn "Local WireGuard endpoint points to #{host}, expected #{expected_ip}."
-          end
-        end
-      else
-        warn "Unable to read #{@config.local_wg_config_path} for local WireGuard endpoint validation."
-      end
-    end
-
-    def info(message)
-      @out.puts(message)
-    end
-
-    def warn(message)
-      @out.puts("WARN: #{message}")
-    end
-  end
-
-  class CLI
-    def initialize(argv)
-      @argv = argv.dup
-    end
-
-    def run
-      global = {
-        config_path: File.join(__dir__, 'hyperstack-vm.toml')
-      }
-
-      global_parser = OptionParser.new do |opts|
-        opts.banner = 'Usage: ruby hyperstack_vm.rb [--config path] <create|delete|status> [options]'
-        opts.on('--config PATH', "Path to TOML config (default: #{global[:config_path]})") do |value|
-          global[:config_path] = value
-        end
-        opts.on('-h', '--help', 'Show help') do
-          puts opts
-          puts
-          puts 'Commands:'
-          puts '  create [--replace] [--dry-run]'
-          puts '  delete [--vm-id ID] [--dry-run]'
-          puts '  status'
-          exit 0
-        end
-      end
-      global_parser.order!(@argv)
-
-      command = @argv.shift
-      raise Error, 'Missing command. Use create, delete, or status.' if command.nil?
-
-      config = Config.load(global[:config_path])
-      state_store = StateStore.new(config.state_file)
-      client = HyperstackClient.new(base_url: config.api_base_url, api_key: config.api_key)
-      local_wireguard = LocalWireGuard.new(
-        interface_name: config.local_interface_name,
-        config_path: config.local_wg_config_path
-      )
-      manager = Manager.new(
-        config: config,
-        client: client,
-        state_store: state_store,
-        local_wireguard: local_wireguard
-      )
-
-      case command
-      when 'create'
-        replace = false
-        dry_run = false
-        parser = OptionParser.new do |opts|
-          opts.on('--replace', 'Delete the tracked VM before creating a new one') { replace = true }
-          opts.on('--dry-run', 'Resolve config and print the create plan without creating a VM') { dry_run = true }
-        end
-        parser.parse!(@argv)
-        manager.create(replace: replace, dry_run: dry_run)
-      when 'delete'
-        vm_id = nil
-        dry_run = false
-        parser = OptionParser.new do |opts|
-          opts.on('--vm-id ID', Integer, 'Delete a VM by ID instead of using the local state file') do |value|
-            vm_id = value
-          end
-          opts.on('--dry-run', 'Show which VM would be deleted without deleting it') { dry_run = true }
-        end
-        parser.parse!(@argv)
-        manager.delete(vm_id: vm_id, dry_run: dry_run)
-      when 'status'
-        manager.status
-      else
-        raise Error, "Unknown command #{command.inspect}. Use create, delete, or status."
-      end
-    end
-  end
-end
-
-begin
-  HyperstackVM::CLI.new(ARGV).run
-rescue HyperstackVM::Error => e
-  warn "ERROR: #{e.message}"
-  exit 1
-end
diff --git a/snippets/hyperstack/vllm-setup.txt b/snippets/hyperstack/vllm-setup.txt
new file mode 100644
index 0000000..9ea44a7
--- /dev/null
+++ b/snippets/hyperstack/vllm-setup.txt
@@ -0,0 +1,487 @@
+# vLLM + LiteLLM + Claude Code Setup for Hyperstack VM
+#
+# This document describes the full deployment of qwen3-coder-next (AWQ 4-bit)
+# via vLLM with a LiteLLM proxy for Claude Code compatibility.
+#
+# Architecture:
+#
+#   Claude Code (earth)                    Hyperstack VM (A100 80GB)
+#   ┌─────────────┐                       ┌──────────────────────────────┐
+#   │ claude CLI   │── Anthropic API ──>  │ LiteLLM proxy (:4000)       │
+#   │              │   /v1/messages        │   translates Anthropic →    │
+#   │              │   via WireGuard wg1   │   OpenAI chat completions   │
+#   └─────────────┘                       │         │                    │
+#                                         │         ▼                    │
+#   OpenCode (earth)                      │ vLLM engine (:11434)        │
+#   ┌─────────────┐                       │   /v1/chat/completions      │
+#   │ opencode     │── OpenAI API ──────> │   FlashAttention v2         │
+#   │              │   /v1/chat/completions│   prefix caching            │
+#   └─────────────┘                       │   bullpoint/Qwen3-Coder-    │
+#                                         │     Next-AWQ-4bit (45GB)    │
+#                                         └──────────────────────────────┘
+#
+# Why vLLM instead of Ollama:
+#   - FlashAttention v2: ~1.5-2x faster prefill for long prompts
+#   - Block-level prefix caching: partial KV cache reuse even when prompt
+#     changes mid-sequence (Ollama requires exact prefix match from token 0)
+#   - Chunked prefill: can interleave prefill and decode
+#   - Marlin kernels for AWQ MoE quantization
+#
+# Why LiteLLM:
+#   - Claude Code speaks Anthropic Messages API (/v1/messages) only
+#   - vLLM speaks OpenAI Chat Completions API (/v1/chat/completions) only
+#   - LiteLLM translates between them, mapping Claude model names to the
+#     actual vLLM model
+#
+# Model details:
+#   - Name: bullpoint/Qwen3-Coder-Next-AWQ-4bit (HuggingFace)
+#   - Architecture: MoE, 80B total params, 3B active per token
+#   - 512 experts, 10 activated + 1 shared per token
+#   - Hybrid attention: Gated DeltaNet + Gated Attention (48 layers)
+#   - Quantization: AWQ 4-bit, group size 32
+#   - Disk size: ~45GB (vs ~151GB at BF16)
+#   - VRAM usage: ~45GB weights + ~27GB KV cache at 92% utilization
+#   - Context: 262,144 tokens (256k native)
+#   - vLLM requirement: >= 0.15.0
+#
+# Hardware requirements:
+#   - Minimum: 1x A100 80GB (PCIe or SXM)
+#   - VRAM breakdown at gpu_memory_utilization=0.92:
+#       Model weights:  ~45 GiB
+#       KV cache:       ~27 GiB (298k tokens capacity, 4.49x concurrency at 262k)
+#       CUDA graphs:    ~3 GiB
+#       Total:          ~75 GiB / 80 GiB
+#
+# Ports:
+#   11434/tcp - vLLM OpenAI-compatible API (reuses Ollama port for firewall compat)
+#   4000/tcp  - LiteLLM Anthropic-compatible proxy
+#   Both restricted to 192.168.3.0/24 (WireGuard wg1 subnet)
+
+# ===========================================================================
+# STEP 1: Prerequisites
+# ===========================================================================
+# - VM with NVIDIA GPU, CUDA drivers, and Docker with nvidia-container-toolkit
+# - WireGuard wg1 tunnel already configured (see wg1-setup.sh)
+# - Ollama stopped and disabled if previously running:
+#
+#   sudo systemctl stop ollama
+#   sudo systemctl disable ollama
+
+# ===========================================================================
+# STEP 2: Storage setup
+# ===========================================================================
+# HuggingFace model cache on ephemeral storage (fast NVMe, survives reboots
+# on some providers but not guaranteed — model will re-download if lost).
+#
+#   sudo mkdir -p /ephemeral/hug
+#   sudo chmod -R 0777 /ephemeral/hug
+
+# ===========================================================================
+# STEP 3: vLLM Docker container
+# ===========================================================================
+# Pull and run vLLM. The model downloads on first start (~45GB, ~2.5 min).
+# After download, model loading takes ~65s and CUDA graph capture ~35s.
+# Total cold start: ~4-5 minutes.
+#
+#   docker pull vllm/vllm-openai:latest
+#
+#   docker run -d \
+#     --gpus all \
+#     --ipc=host \
+#     --network host \
+#     --name vllm_qwen3 \
+#     --restart always \
+#     -v /ephemeral/hug:/root/.cache/huggingface \
+#     vllm/vllm-openai:latest \
+#     --model bullpoint/Qwen3-Coder-Next-AWQ-4bit \
+#     --tensor-parallel-size 1 \
+#     --enable-auto-tool-choice \
+#     --tool-call-parser qwen3_coder \
+#     --enable-prefix-caching \
+#     --gpu-memory-utilization 0.92 \
+#     --max-model-len 262144 \
+#     --host 0.0.0.0 \
+#     --port 11434
+#
+# Flags explained:
+#   --tensor-parallel-size 1    Single GPU (use 2/4 for multi-GPU setups)
+#   --enable-auto-tool-choice   Enables function/tool calling
+#   --tool-call-parser qwen3_coder   Parser for qwen3-coder tool format
+#   --enable-prefix-caching     Block-level KV cache reuse across requests
+#   --gpu-memory-utilization 0.92   Use 92% of VRAM (rest for OS/overhead)
+#   --max-model-len 262144      Full 256k context window
+#   --port 11434                Reuse Ollama port for firewall compatibility
+#
+# Verify startup (wait for "Application startup complete"):
+#   docker logs -f vllm_qwen3 2>&1 | grep -E "startup complete|Error"
+#
+# Verify model loaded:
+#   curl -s http://localhost:11434/v1/models | python3 -m json.tool
+#
+# Quick inference test:
+#   curl -s http://localhost:11434/v1/chat/completions \
+#     -H "Content-Type: application/json" \
+#     -H "Authorization: Bearer EMPTY" \
+#     -d '{"model":"bullpoint/Qwen3-Coder-Next-AWQ-4bit",
+#          "messages":[{"role":"user","content":"Hello"}],
+#          "max_tokens":50}'
+#
+# Monitor performance (prefix cache hit rate, throughput):
+#   docker logs -f vllm_qwen3 2>&1 | grep "Engine 000"
+
+# ===========================================================================
+# STEP 4: LiteLLM proxy (Anthropic API translation for Claude Code)
+# ===========================================================================
+# Install in a Python venv (Ubuntu 24.04 requires this):
+#
+#   sudo apt-get install -y python3.12-venv
+#   sudo mkdir -p /ephemeral/litellm-env
+#   sudo chown ubuntu:ubuntu /ephemeral/litellm-env
+#   python3 -m venv /ephemeral/litellm-env
+#   /ephemeral/litellm-env/bin/pip install "litellm[proxy]"
+#
+# Write config file:
+#
+#   sudo tee /ephemeral/litellm-config.yaml > /dev/null << "YAML"
+#   model_list:
+#     - model_name: "claude-sonnet-4-20250514"
+#       litellm_params:
+#         model: "hosted_vllm/bullpoint/Qwen3-Coder-Next-AWQ-4bit"
+#         api_base: "http://localhost:11434/v1"
+#         api_key: "EMPTY"
+#     - model_name: "claude-opus-4-20250514"
+#       litellm_params:
+#         model: "hosted_vllm/bullpoint/Qwen3-Coder-Next-AWQ-4bit"
+#         api_base: "http://localhost:11434/v1"
+#         api_key: "EMPTY"
+#     - model_name: "claude-opus-4-6-20260604"
+#       litellm_params:
+#         model: "hosted_vllm/bullpoint/Qwen3-Coder-Next-AWQ-4bit"
+#         api_base: "http://localhost:11434/v1"
+#         api_key: "EMPTY"
+#     - model_name: "claude-haiku-3-5-20241022"
+#       litellm_params:
+#         model: "hosted_vllm/bullpoint/Qwen3-Coder-Next-AWQ-4bit"
+#         api_base: "http://localhost:11434/v1"
+#         api_key: "EMPTY"
+#
+#   litellm_settings:
+#     drop_params: true
+#
+#   general_settings:
+#     master_key: "sk-litellm-master"
+#   YAML
+#
+# Config notes:
+#   - model_name values must match what Claude Code sends (Claude model IDs)
+#   - "hosted_vllm/" prefix forces LiteLLM to use /v1/chat/completions
+#     (not /v1/responses which vLLM doesn't fully support for complex messages)
+#   - drop_params: true — silently drops Claude-specific parameters like
+#     context_management that vLLM doesn't understand
+#   - master_key is the API key clients must send
+#   - Add new model_name entries when Anthropic releases new model IDs
+#
+# Start LiteLLM:
+#
+#   nohup /ephemeral/litellm-env/bin/litellm \
+#     --config /ephemeral/litellm-config.yaml \
+#     --host 0.0.0.0 \
+#     --port 4000 \
+#     > /ephemeral/litellm.log 2>&1 &
+#
+# Verify:
+#   curl -s http://localhost:4000/v1/messages \
+#     -H "Content-Type: application/json" \
+#     -H "x-api-key: sk-litellm-master" \
+#     -H "anthropic-version: 2023-06-01" \
+#     -d '{"model":"claude-opus-4-6-20260604","max_tokens":50,
+#          "messages":[{"role":"user","content":"Hello"}]}'
+#
+# For production, create a systemd service instead of nohup:
+#
+#   sudo tee /etc/systemd/system/litellm.service > /dev/null << "UNIT"
+#   [Unit]
+#   Description=LiteLLM Proxy
+#   After=network.target docker.service
+#   Requires=docker.service
+#
+#   [Service]
+#   Type=simple
+#   User=ubuntu
+#   ExecStart=/ephemeral/litellm-env/bin/litellm \
+#     --config /ephemeral/litellm-config.yaml \
+#     --host 0.0.0.0 --port 4000
+#   Restart=always
+#   RestartSec=5
+#
+#   [Install]
+#   WantedBy=multi-user.target
+#   UNIT
+#
+#   sudo systemctl daemon-reload
+#   sudo systemctl enable --now litellm
+
+# ===========================================================================
+# STEP 5: Firewall rules
+# ===========================================================================
+# Allow access from WireGuard subnet only:
+#
+#   sudo ufw allow from 192.168.3.0/24 to any port 11434 proto tcp \
+#     comment 'vLLM via wg1'
+#   sudo ufw allow from 192.168.3.0/24 to any port 4000 proto tcp \
+#     comment 'LiteLLM proxy via wg1'
+
+# ===========================================================================
+# STEP 6: Client configuration (on earth / local machine)
+# ===========================================================================
+#
+# --- Claude Code ---
+# Launch with environment variables pointing at LiteLLM proxy:
+#
+#   ANTHROPIC_BASE_URL=http://192.168.3.1:4000 \
+#   ANTHROPIC_API_KEY=sk-litellm-master \
+#   claude --model claude-opus-4-6-20260604 --dangerously-skip-permissions
+#
+# Fish shell alias (add to ~/.config/fish/config.fish):
+#
+#   alias claude-local='ANTHROPIC_BASE_URL=http://192.168.3.1:4000 \
+#     ANTHROPIC_API_KEY=sk-litellm-master \
+#     claude --model claude-opus-4-6-20260604 --dangerously-skip-permissions'
+#
+# --- OpenCode ---
+# Connects directly to vLLM (no LiteLLM needed, speaks OpenAI natively):
+#
+#   OPENAI_BASE_URL=http://192.168.3.1:11434/v1 \
+#   OPENAI_API_KEY=EMPTY \
+#   opencode
+#
+# Model name in OpenCode config: bullpoint/Qwen3-Coder-Next-AWQ-4bit
+
+# ===========================================================================
+# STEP 7: Monitoring & troubleshooting
+# ===========================================================================
+#
+# --- Live engine stats ---
+# vLLM logs engine metrics every 10 seconds. Key fields:
+#   - Avg prompt throughput:     prefill speed (tokens/s), higher = faster
+#   - Avg generation throughput: decode speed (tokens/s), ~40-99 on A100 PCIe
+#   - GPU KV cache usage:        % of KV cache memory in use (proportional to
+#                                 active context length vs max capacity)
+#   - Prefix cache hit rate:     % of prompt tokens served from cache (0% for
+#                                 Claude Code, higher for OpenCode)
+#   - Running/Waiting:           active and queued request counts
+#
+# Follow live (all stats):
+#   docker logs -f vllm_qwen3 2>&1 | grep "Engine 000"
+#
+# Example output:
+#   Engine 000: Avg prompt throughput: 5555.2 tokens/s,
+#               Avg generation throughput: 49.4 tokens/s,
+#               Running: 1 reqs, Waiting: 0 reqs,
+#               GPU KV cache usage: 4.6%,
+#               Prefix cache hit rate: 0.0%
+#
+# --- Request-level monitoring ---
+# See individual HTTP requests (method, status, duration):
+#   docker logs -f vllm_qwen3 2>&1 | grep "POST"
+#
+# Example output:
+#   127.0.0.1:41864 - "POST /v1/chat/completions HTTP/1.1" 200 OK
+#
+# --- One-liner: last minute stats ---
+# Useful for periodic checks without following the log:
+#   docker logs --since 1m vllm_qwen3 2>&1 | grep "Engine 000"
+#
+# --- LiteLLM proxy log ---
+#   tail -f /ephemeral/litellm.log
+#
+# --- GPU hardware stats ---
+# Snapshot:
+#   nvidia-smi
+#
+# Continuous (every 5 seconds):
+#   nvidia-smi --query-gpu=temperature.gpu,utilization.gpu,power.draw,memory.used \
+#     --format=csv -l 5
+#
+# --- Interpreting the stats ---
+#
+# Healthy baseline (A100 80GB PCIe, qwen3-coder-next AWQ 4-bit):
+#   Prefill throughput:   5,000-11,000 tok/s (bursts higher during batch prefill)
+#   Decode throughput:    40-99 tok/s (varies with output length per sample)
+#   KV cache usage:       0-5% for short conversations, grows with context
+#                         (100% = 298k tokens, at which point requests queue)
+#   Prefix cache hit:     0% for Claude Code (expected, it mutates prompt prefix)
+#                         >50% for OpenCode after a few turns
+#   Temperature:          44-60C under load, <45C idle
+#   Power:                70W idle, 230-240W under load, 300W max
+#
+# Warning signs:
+#   - Waiting > 0 for extended periods → requests queuing, model overloaded
+#   - KV cache usage near 100% → context too long, reduce --max-model-len
+#   - Decode throughput < 20 tok/s sustained → possible thermal throttling
+#   - Prefill throughput < 2,000 tok/s → check for CPU offload or driver issues
+#
+# Common issues:
+#
+# 1. OOM on startup with --max-model-len 262144
+#    → Reduce to 131072 or 65536
+#
+# 2. "model does not exist" from vLLM
+#    → Model name in LiteLLM config must exactly match HuggingFace repo name
+#
+# 3. LiteLLM returns UnsupportedParamsError
+#    → Ensure drop_params: true is in litellm_settings
+#
+# 4. LiteLLM routes to /v1/responses instead of /v1/chat/completions
+#    → Use "hosted_vllm/" prefix in model field, not "openai/"
+#
+# 5. Claude Code "Auth conflict" warning
+#    → Run `claude /logout` first to clear the claude.ai session token,
+#      then re-launch with ANTHROPIC_API_KEY=sk-litellm-master
+#
+# 6. Prefix cache hit rate stays at 0%
+#    → Normal for Claude Code (it mutates the prompt prefix each turn)
+#    → OpenCode should show increasing cache hit rates after a few turns
+#
+# 7. vLLM container won't start (CUDA version mismatch)
+#    → Check driver version: nvidia-smi
+#    → vLLM requires CUDA >= 12.x and driver >= 535
+
+# ===========================================================================
+# STEP 8: Loading / switching models
+# ===========================================================================
+#
+# vLLM serves one model per container. To switch models, stop the current
+# container and start a new one with different --model.
+#
+# --- Stop current model ---
+#   docker stop vllm_qwen3
+#   docker rm vllm_qwen3
+#
+# --- Run a different model ---
+# Replace --model, --name, and adjust --max-model-len and --tool-call-parser
+# as needed. The HuggingFace model downloads automatically on first start.
+#
+# Example: qwen3-coder:30b (smaller, faster, fits easily on A100 80GB)
+#
+#   docker run -d \
+#     --gpus all \
+#     --ipc=host \
+#     --network host \
+#     --name vllm_qwen3_30b \
+#     --restart always \
+#     -v /ephemeral/hug:/root/.cache/huggingface \
+#     vllm/vllm-openai:latest \
+#     --model Qwen/Qwen3-Coder-30B-AWQ \
+#     --tensor-parallel-size 1 \
+#     --enable-auto-tool-choice \
+#     --tool-call-parser qwen3_coder \
+#     --enable-prefix-caching \
+#     --gpu-memory-utilization 0.92 \
+#     --max-model-len 131072 \
+#     --host 0.0.0.0 \
+#     --port 11434
+#
+# Example: full-precision model on multi-GPU (e.g. 4x H100)
+#
+#   docker run -d \
+#     --gpus all \
+#     --ipc=host \
+#     --network host \
+#     --name vllm_qwen3_fp16 \
+#     --restart always \
+#     -v /ephemeral/hug:/root/.cache/huggingface \
+#     vllm/vllm-openai:latest \
+#     --model Qwen/Qwen3-Coder-Next \
+#     --tensor-parallel-size 4 \
+#     --enable-auto-tool-choice \
+#     --tool-call-parser qwen3_coder \
+#     --enable-prefix-caching \
+#     --gpu-memory-utilization 0.90 \
+#     --max-model-len 262144 \
+#     --host 0.0.0.0 \
+#     --port 11434
+#
+# --- Update LiteLLM config to match ---
+# After switching models, update the model field in litellm-config.yaml
+# to match the new HuggingFace model name:
+#
+#   model: "hosted_vllm/<new-model-name>"
+#
+# Then restart LiteLLM:
+#   pkill -f litellm
+#   nohup /ephemeral/litellm-env/bin/litellm \
+#     --config /ephemeral/litellm-config.yaml \
+#     --host 0.0.0.0 --port 4000 \
+#     > /ephemeral/litellm.log 2>&1 &
+#
+# --- Finding models ---
+# Search HuggingFace for vLLM-compatible quantized models:
+#   https://huggingface.co/models?search=<model-name>+awq
+#   https://huggingface.co/models?search=<model-name>+gptq
+#
+# Supported quantization formats in vLLM:
+#   - AWQ (recommended): fast Marlin kernels, good quality
+#   - GPTQ: similar to AWQ, widely available
+#   - FP8: 8-bit, needs Hopper+ GPUs (H100/H200)
+#   - BF16/FP16: full precision, needs more VRAM
+#
+# --- VRAM sizing guide ---
+# Rule of thumb for single A100 80GB at 92% utilization (~75 GiB usable):
+#
+#   Model size (params)  | AWQ 4-bit VRAM | Max context (remaining for KV)
+#   ---------------------|----------------|-------------------------------
+#   7-8B                 | ~5 GiB         | 262k+ (plenty of KV headroom)
+#   14B                  | ~9 GiB         | 262k+ (plenty of KV headroom)
+#   30-32B               | ~18 GiB        | 262k  (~57 GiB for KV cache)
+#   70-80B (MoE, 3B act) | ~45 GiB        | 262k  (~27 GiB for KV cache)
+#   70B (dense)          | ~38 GiB        | 131k  (~37 GiB for KV cache)
+#   120B+                | won't fit      | use multi-GPU or smaller quant
+#
+# If vLLM OOMs on startup, reduce --max-model-len first (halving it roughly
+# halves KV cache memory). If still OOM, reduce --gpu-memory-utilization
+# to 0.85 or try a smaller model.
+#
+# --- Verifying the new model ---
+# Check loaded model:
+#   curl -s http://localhost:11434/v1/models | python3 -m json.tool
+#
+# Test inference:
+#   curl -s http://localhost:11434/v1/chat/completions \
+#     -H "Content-Type: application/json" \
+#     -H "Authorization: Bearer EMPTY" \
+#     -d '{"model":"<model-name>",
+#          "messages":[{"role":"user","content":"Hello"}],
+#          "max_tokens":50}'
+#
+# Test via LiteLLM (Anthropic API):
+#   curl -s http://localhost:4000/v1/messages \
+#     -H "Content-Type: application/json" \
+#     -H "x-api-key: sk-litellm-master" \
+#     -H "anthropic-version: 2023-06-01" \
+#     -d '{"model":"claude-opus-4-6-20260604","max_tokens":50,
+#          "messages":[{"role":"user","content":"Hello"}]}'
+
+# ===========================================================================
+# Performance characteristics (A100 80GB PCIe, single GPU)
+# ===========================================================================
+#
+# Measured on 2026-03-16 with bullpoint/Qwen3-Coder-Next-AWQ-4bit:
+#
+#   vLLM prefill throughput:    5,000-11,000 tok/s (FlashAttention v2)
+#   vLLM decode throughput:     40-99 tok/s (memory-bandwidth limited)
+#   Per-turn latency:           ~10-15s (small prompts, early conversation)
+#   KV cache usage:             2-5% for typical coding sessions
+#   Prefix cache hit rate:      0% (Claude Code), expected >50% (OpenCode)
+#
+# Comparison with Ollama on same hardware (A100 80GB PCIe):
+#
+#                          | Ollama (Q4_K_M)       | vLLM (AWQ 4-bit)
+#   -----------------------|-----------------------|----------------------
+#   Prefill throughput     | ~1,000 tok/s (est.)   | 5,000-11,000 tok/s
+#   Decode throughput      | ~40 tok/s             | 40-99 tok/s
+#   Per-turn latency       | ~28s (32k ctx)        | ~10-15s
+#   Context window         | 32k (was truncating)  | 262k (full, no truncation)
+#   Prefix cache (Claude)  | 0% always             | 0% always
+#   Prefix cache (OpenCode)| 85-95% when warm      | expected similar or better
+#   VRAM usage             | 52-61 GiB             | 75 GiB (more KV cache)
-- 
cgit v1.2.3