summaryrefslogtreecommitdiff
path: root/hyperstack-vm1.toml
AgeCommit message (Collapse)Author
2026-05-24chore(config): remove gpt-oss-120b references since qwen3.6 is betterPaul Buetow
2026-05-24replace qwen3-coder-next with qwen3.6-27b across configs, docs, and toolingPaul Buetow
2026-05-24feat(cli): replace --config with --vm 1|2|both, remove create-both/delete-bothPaul Buetow
- Drop single-VM default hyperstack-vm.toml and @config_path/@config_explicit machinery - Add global --vm flag (default: 1) mapping to hyperstack-vm1.toml and/or hyperstack-vm2.toml - Fold create-both and delete-both into create/delete --vm both - Teach status, watch, test, model to accept --vm (default: 1) - Update help text and README/AGENTS/fish abbreviations accordingly
2026-04-11Rename VM1 configs: default hyperstack-vm1.toml, Nemotron in -nemotronPaul Buetow
Move the former hyperstack-vm1-coder.toml to hyperstack-vm1.toml as the standard VM1 profile (Qwen3-Coder-Next on single GPU). Preserve the dual-H100 Nemotron-3-Super stack as hyperstack-vm1-nemotron.toml. Point create-both at hyperstack-vm1.toml and refresh README for current defaults. Made-with: Cursor
2026-03-24gpt-oss-120b: enable reasoning via openai_gptoss parserPaul Buetow
- Add --reasoning-parser openai_gptoss to gpt-oss-120b vLLM config in all three toml files; extracts <|channel|>analysis thinking blocks into reasoning_content in API responses - Mark gpt-oss-120b as reasoning: true in pi/agent/models.json for all three providers (hyperstack, hyperstack1, hyperstack2) - Update vm1 state file Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-22Upgrade VM1 to H100x2 with 1M context for Nemotron-3-SuperPaul Buetow
Switch VM1 from n3-H100x1 to n3-H100x2 to run Nemotron-3-Super with 1M token context window via tensor parallelism. The dual-GPU setup (160 GB total VRAM) provides enough KV cache headroom to override the model's config.json limit of 262144 tokens. Key changes: - flavor_name: n3-H100x1 → n3-H100x2 - tensor_parallel_size: 1 → 2 - max_model_len: 131072 → 1048576 (with VLLM_ALLOW_LONG_MAX_MODEL_LEN=1) - gpu_memory_utilization: 0.92 → 0.85 (headroom for Mamba cache + sampler warmup) - Remove --enforce-eager: no longer needed with dual-GPU VRAM budget - Disable prefix caching: on NemotronH it forces Mamba "all" cache mode which pre-allocates states for all max_num_seqs and OOMs before the sampler warmup pass; per-request allocation is cheaper at startup Add two new vllm config fields to hyperstack.rb: - extra_docker_env: passes -e KEY=VALUE flags to Docker before the image name (used for VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 and PYTORCH_ALLOC_CONF=expandable_segments:True) - enable_prefix_caching: makes --enable-prefix-caching conditional (default true for backward compat; false for NemotronH) Both fields are supported in [vllm] defaults and [vllm.presets.*] overrides with the same fallback semantics as existing fields. Update pi/agent/models.json: Nemotron vm1 entry renamed to "Nemotron 3 Super 120B 1M [vm1]" with contextWindow 1048576. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-21Fix nemotron-3-super vLLM OOM: cap context and add --enforce-eagerPaul Buetow
The [vllm] defaults had max_model_len=262144 without --enforce-eager, causing the vLLM container to OOM on startup (CUDA graph capture costs ~3-4 GB on top of ~60 GB nemotron weights on the A100 80GB). Also switch flavor to n3-H100x1 since n3-A100x1 is out of stock in CANADA-1. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-21Fix Nemotron OOM; add VM lifecycle fish abbrs; document automated setupPaul Buetow
- hyperstack-vm1/vm2.toml: reduce nemotron-super max_model_len 262144→131072 and add --enforce-eager to disable CUDA graph capture (~3-4 GB overhead). Nemotron 120B weights (~60 GB) leave too little VRAM headroom for KV cache allocation and CUDA graph buffers at 262K context on a single A100 80GB. 131K context with eager mode is stable. README VRAM table updated to match. - hyperstack.fish: add hyperstack-create/delete/test and hyperstack-create/delete-both abbreviations for VM lifecycle management alongside the existing pi-* aliases. - README.md: add "Automated setup reference" section with single-VM and two-VM command flows before the manual vLLM Docker setup section. End-to-end tested: single VM (GPT-OSS 120B), dual VM (Nemotron + Qwen3-Coder), pi queries on all three models — all passed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-21Remove LiteLLM and Claude Code repo references (task 301)Paul Buetow
2026-03-21initial importPaul Buetow