diff options
| author | Paul Buetow <paul@buetow.org> | 2026-03-21 14:56:38 +0200 |
|---|---|---|
| committer | Paul Buetow <paul@buetow.org> | 2026-03-21 14:56:38 +0200 |
| commit | 1acb65324e8d7d520483535843a753757b0dd4a0 (patch) | |
| tree | fccaf3d2e10144489e4a480c3e70523133238fb6 /hyperstack-vm.toml | |
| parent | 535f0adad3b0237b7d465bbd8231d5559c7be7d9 (diff) | |
Fix Nemotron OOM; add VM lifecycle fish abbrs; document automated setup
- hyperstack-vm1/vm2.toml: reduce nemotron-super max_model_len 262144→131072
and add --enforce-eager to disable CUDA graph capture (~3-4 GB overhead).
Nemotron 120B weights (~60 GB) leave too little VRAM headroom for KV cache
allocation and CUDA graph buffers at 262K context on a single A100 80GB.
131K context with eager mode is stable. README VRAM table updated to match.
- hyperstack.fish: add hyperstack-create/delete/test and hyperstack-create/delete-both
abbreviations for VM lifecycle management alongside the existing pi-* aliases.
- README.md: add "Automated setup reference" section with single-VM and two-VM
command flows before the manual vLLM Docker setup section.
End-to-end tested: single VM (GPT-OSS 120B), dual VM (Nemotron + Qwen3-Coder),
pi queries on all three models — all passed.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Diffstat (limited to 'hyperstack-vm.toml')
0 files changed, 0 insertions, 0 deletions
