| Age | Commit message (Collapse) | Author |
|
|
|
|
|
max_position_embeddings=131072 in model config.json; exceeding it causes
NaN/CUDA OOB. 163840 was rejected by vLLM at startup. The 135K error
requires starting a fresh opencode conversation instead.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
|
|
131K was still too small — observed 135K token conversations in practice.
Physical KV capacity is 168K blocks so 160K is safe without OOM.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
|
|
MXFP4 KV cache is compact enough that vLLM allocated 168K token blocks
(10560×16) at 0.92 utilization — the 40K limit was too conservative and
caused negative max_tokens errors in long Claude Code sessions.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
|
|
Tested 1M context (NoPE allows arbitrary max_position_embeddings without
YaRN) — OOMs on A100 80GB due to insufficient VRAM after 60GB model weights.
256K (262144) is the practical ceiling on this hardware.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
|
|
Both Nemotron and Qwen3-XML use identical <tool_call><function=name>
<parameter=p>value</parameter></function></tool_call> format.
qwen3_xml correctly parses Nemotron's output; tool calling now works
with opencode and other API clients.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
|
|
vLLM 0.17.1 has no tool call parser for Nemotron's custom XML format
(<tool_call><function=...><parameter=...>). Setting llama3_json produced
garbage output. Reverted to tool_call_parser="" with a clear comment.
Added --reasoning-parser nemotron_v3 via extra_vllm_args so <think> tokens
are properly exposed as reasoning_content in the API response.
For agentic work requiring tool calls, switch to qwen3-coder-next or devstral.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
|
|
known_hosts
- hyperstack-vm.toml: set tool_call_parser=llama3_json for nemotron-super so vLLM
accepts tool_choice requests from opencode; model won't spontaneously call tools
so the vLLM 0.17.1 token_ids crash in llama3_json won't trigger
- hyperstack.rb: wait_for_ssh now also removes the WireGuard hostname
(hyperstack.wg1) from known_hosts alongside the IP, preventing
StrictHostKeyChecking failures across VM recreates
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
|
|
deepseek-r1-32b, qwen3-32b, devstral presets
- hyperstack.rb: add extra_vllm_args array field to preset resolver and
vllm_install_script; flags are appended verbatim to the docker run command,
enabling per-preset vLLM flags (reasoning parsers, Mistral loader)
- hyperstack.rb: show extra_args in dry-run model switch output
- hyperstack-vm.toml: fix nemotron-super to use actual NVIDIA Nemotron-3-Super-120B-A12B
AWQ (cyankiwi) with trust_remote_code=true; previous preset incorrectly used llama-3.3-70b
- hyperstack-vm.toml: add deepseek-r1-32b (--reasoning-parser deepseek_r1, ~18 GB)
- hyperstack-vm.toml: add qwen3-32b (--reasoning-parser deepseek_r1, ~18 GB)
- hyperstack-vm.toml: add devstral (Mistral tokenizer+config format, ~15 GB); --load_format
mistral omitted because AWQ weights are in standard HF safetensors format
All 6 new/updated presets end-to-end tested on A100 80GB (vLLM 0.17.1).
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
|
|
WireGuard
New vLLM model presets (all end-to-end tested on A100 80GB):
- gpt-oss-20b: openai/gpt-oss-20b — MoE 20B, ~14 GB MXFP4, ultra-fast (3.6B active)
- gpt-oss-120b: openai/gpt-oss-120b — MoE 120B, ~65 GB MXFP4, powerful reasoning
- qwen25-coder-32b: Qwen/Qwen2.5-Coder-32B-Instruct-AWQ — ~18 GB, best 32B coder
- qwen3-coder-30b: QuantTrio/Qwen3-Coder-30B-A3B-Instruct-AWQ — ~18 GB Qwen3 coder
gpt-oss models disable --enable-auto-tool-choice (tool_call_parser = ""): vLLM 0.17.1's
llama3_json parser crashes on gpt-oss responses because the new token_ids field in the
response is passed as an unexpected keyword argument to extract_tool_calls().
gpt-oss-120b max_model_len raised to 40960: Claude Code's system prompt alone is ~33K
tokens, so 16K was insufficient. 40K allows Claude Code to connect with headroom.
Use wireguard_gateway_hostname (hyperstack.wg1) instead of raw 192.168.3.1 IP for all
connection URLs (tests, ready message, dry-run output). The hostname is derived from the
wg interface name and resolves via /etc/hosts.
Fix test max_tokens: raised from 50 to 500 so reasoning models (e.g. gpt-oss) have
enough tokens to complete chain-of-thought before producing content.
Fix qwen25-coder-32b max_model_len: model config has max_position_embeddings=32768,
not 128K as assumed. Using 65536 caused a vLLM pydantic validation error.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
|
|
Replace cyankiwi/Llama-3_3-Nemotron-Super-49B-v1_5-AWQ-4bit with
casperhansen/llama-3.3-70b-instruct-awq for the nemotron-super preset.
The NAS model's config.json has num_key_value_heads=null by design for
its heterogeneous per-layer attention architecture, which is incompatible
with vLLM's pydantic ModelConfig validation (requires int). No working
AWQ quant for this architecture exists; Llama-3.3-70B-Instruct AWQ is
a proven drop-in for the extended-analysis use case.
Also fix test_vllm to use the model reported by /v1/models instead of
the static config default, so tests pass after a model switch.
Add trust_remote_code support to vllm_install_script for future models
that require custom HuggingFace model code.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
|
|
- New [vllm.presets.*] TOML section with two presets:
qwen3-coder-next bullpoint/Qwen3-Coder-Next-AWQ-4bit (256k ctx, coding)
nemotron-super solidrust/Llama-3.3-Nemotron-Super-49B-v1-AWQ (131k ctx, analysis)
- New CLI subcommand: `model list` — show presets, mark the active one
- New CLI subcommand: `model switch PRESET [--dry-run]` — switch the
running VM to a different preset without redeploying:
1. stops old Docker container (if container_name differs)
2. starts new container and waits for model readiness
3. hot-reloads LiteLLM config via litellm_reload_script (no venv reinstall)
4. updates state file with new vllm_model / vllm_container_name / vllm_preset
- New `create --model PRESET` flag — deploy with a non-default preset
- vllm_install_script and litellm_install_script now accept preset_config:/
model_override: so callers can override individual fields without
duplicating the full config
- State file now tracks vllm_container_name and vllm_preset for clean
container lifecycle management across switches
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
|
|
- Replace Ollama (disabled by default) with vLLM Docker container +
LiteLLM Anthropic-API proxy as the default inference backend
- vLLM setup: pulls vllm/vllm-openai, starts container on port 11434,
polls until model is loaded (up to 10 min for first 45 GB download)
- LiteLLM setup: installs in Python venv, writes config mapping Claude
model aliases to the vLLM model, runs as a systemd service on port 4000
- New CLI flags on `create`: --vllm/--no-vllm, --ollama/--no-ollama to
override config at runtime
- New `test` command: end-to-end inference test over WireGuard against
vLLM (/v1/models + /v1/chat/completions) and LiteLLM (/v1/messages)
- UFW rules now open both port 11434 (inference) and 4000 (LiteLLM)
from the WireGuard subnet
- Rename hyperstack_vm.rb → hyperstack.rb
- Add README.md with quickstart, Claude Code / OpenCode usage, CLI
reference, monitoring commands, and VRAM sizing notes
- Add vllm-setup.txt: detailed manual setup notes and architecture docs
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
|
|
retries, apt lock waits, and model verification
|
|
|