| Age | Commit message (Collapse) | Author |
|
|
|
|
|
|
|
|
|
hyperstack.rb and wg1-setup.sh for multi-VM WireGuard support
|
|
- model switch now passes pull_image: false to avoid surprise multi-GB
image downloads when the upstream vLLM image was updated upstream;
docker pull is still run on initial install (pull_image: true default)
- mount /ephemeral/vllm_cache → /root/.cache/vllm so torch.compile
artifacts survive container restarts; saves ~30-60 s on warm switches
- add vllm_compile_cache_dir helper (sibling of hug_cache_dir)
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
|
|
|
|
|
|
|
|
- Created ConfigLoader for TOML loading and validation
- Kept Config for configuration value access only
- Reduced Config from 489 lines to ~200 lines
- Fixed CLI to use ConfigLoader and pass @path to Config
|
|
Show the currently loaded model (from state file, or config default)
so it's immediately visible without running `model list`.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
|
|
known_hosts
- hyperstack-vm.toml: set tool_call_parser=llama3_json for nemotron-super so vLLM
accepts tool_choice requests from opencode; model won't spontaneously call tools
so the vLLM 0.17.1 token_ids crash in llama3_json won't trigger
- hyperstack.rb: wait_for_ssh now also removes the WireGuard hostname
(hyperstack.wg1) from known_hosts alongside the IP, preventing
StrictHostKeyChecking failures across VM recreates
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
|
|
deepseek-r1-32b, qwen3-32b, devstral presets
- hyperstack.rb: add extra_vllm_args array field to preset resolver and
vllm_install_script; flags are appended verbatim to the docker run command,
enabling per-preset vLLM flags (reasoning parsers, Mistral loader)
- hyperstack.rb: show extra_args in dry-run model switch output
- hyperstack-vm.toml: fix nemotron-super to use actual NVIDIA Nemotron-3-Super-120B-A12B
AWQ (cyankiwi) with trust_remote_code=true; previous preset incorrectly used llama-3.3-70b
- hyperstack-vm.toml: add deepseek-r1-32b (--reasoning-parser deepseek_r1, ~18 GB)
- hyperstack-vm.toml: add qwen3-32b (--reasoning-parser deepseek_r1, ~18 GB)
- hyperstack-vm.toml: add devstral (Mistral tokenizer+config format, ~15 GB); --load_format
mistral omitted because AWQ weights are in standard HF safetensors format
All 6 new/updated presets end-to-end tested on A100 80GB (vLLM 0.17.1).
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
|
|
WireGuard
New vLLM model presets (all end-to-end tested on A100 80GB):
- gpt-oss-20b: openai/gpt-oss-20b — MoE 20B, ~14 GB MXFP4, ultra-fast (3.6B active)
- gpt-oss-120b: openai/gpt-oss-120b — MoE 120B, ~65 GB MXFP4, powerful reasoning
- qwen25-coder-32b: Qwen/Qwen2.5-Coder-32B-Instruct-AWQ — ~18 GB, best 32B coder
- qwen3-coder-30b: QuantTrio/Qwen3-Coder-30B-A3B-Instruct-AWQ — ~18 GB Qwen3 coder
gpt-oss models disable --enable-auto-tool-choice (tool_call_parser = ""): vLLM 0.17.1's
llama3_json parser crashes on gpt-oss responses because the new token_ids field in the
response is passed as an unexpected keyword argument to extract_tool_calls().
gpt-oss-120b max_model_len raised to 40960: Claude Code's system prompt alone is ~33K
tokens, so 16K was insufficient. 40K allows Claude Code to connect with headroom.
Use wireguard_gateway_hostname (hyperstack.wg1) instead of raw 192.168.3.1 IP for all
connection URLs (tests, ready message, dry-run output). The hostname is derived from the
wg interface name and resolves via /etc/hosts.
Fix test max_tokens: raised from 50 to 500 so reasoning models (e.g. gpt-oss) have
enough tokens to complete chain-of-thought before producing content.
Fix qwen25-coder-32b max_model_len: model config has max_position_embeddings=32768,
not 128K as assumed. Using 65536 caused a vLLM pydantic validation error.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
|
|
Replace cyankiwi/Llama-3_3-Nemotron-Super-49B-v1_5-AWQ-4bit with
casperhansen/llama-3.3-70b-instruct-awq for the nemotron-super preset.
The NAS model's config.json has num_key_value_heads=null by design for
its heterogeneous per-layer attention architecture, which is incompatible
with vLLM's pydantic ModelConfig validation (requires int). No working
AWQ quant for this architecture exists; Llama-3.3-70B-Instruct AWQ is
a proven drop-in for the extended-analysis use case.
Also fix test_vllm to use the model reported by /v1/models instead of
the static config default, so tests pass after a model switch.
Add trust_remote_code support to vllm_install_script for future models
that require custom HuggingFace model code.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
|
|
- New [vllm.presets.*] TOML section with two presets:
qwen3-coder-next bullpoint/Qwen3-Coder-Next-AWQ-4bit (256k ctx, coding)
nemotron-super solidrust/Llama-3.3-Nemotron-Super-49B-v1-AWQ (131k ctx, analysis)
- New CLI subcommand: `model list` — show presets, mark the active one
- New CLI subcommand: `model switch PRESET [--dry-run]` — switch the
running VM to a different preset without redeploying:
1. stops old Docker container (if container_name differs)
2. starts new container and waits for model readiness
3. hot-reloads LiteLLM config via litellm_reload_script (no venv reinstall)
4. updates state file with new vllm_model / vllm_container_name / vllm_preset
- New `create --model PRESET` flag — deploy with a non-default preset
- vllm_install_script and litellm_install_script now accept preset_config:/
model_override: so callers can override individual fields without
duplicating the full config
- State file now tracks vllm_container_name and vllm_preset for clean
container lifecycle management across switches
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
|
|
- Replace Ollama (disabled by default) with vLLM Docker container +
LiteLLM Anthropic-API proxy as the default inference backend
- vLLM setup: pulls vllm/vllm-openai, starts container on port 11434,
polls until model is loaded (up to 10 min for first 45 GB download)
- LiteLLM setup: installs in Python venv, writes config mapping Claude
model aliases to the vLLM model, runs as a systemd service on port 4000
- New CLI flags on `create`: --vllm/--no-vllm, --ollama/--no-ollama to
override config at runtime
- New `test` command: end-to-end inference test over WireGuard against
vLLM (/v1/models + /v1/chat/completions) and LiteLLM (/v1/messages)
- UFW rules now open both port 11434 (inference) and 4000 (LiteLLM)
from the WireGuard subnet
- Rename hyperstack_vm.rb → hyperstack.rb
- Add README.md with quickstart, Claude Code / OpenCode usage, CLI
reference, monitoring commands, and VRAM sizing notes
- Add vllm-setup.txt: detailed manual setup notes and architecture docs
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
|