# Plan: VM1 on Hyperstack L40 with Qwen3.6 MoE + TurboQuant **Prepared:** 2026-05-24 **Scope:** Research and planning only — no code changes, no provisioning. --- ## 1. GPU and VM sizing (Hyperstack L40) | Item | Assessment | |---|---| | **Flavor** | Hyperstack’s GPU flavors use the `n3-*` prefix (see current `n3-A100x1` / `n3-H100x1`). The L40 48 GB flavor is expected to be named `n3-L40x1` or `n3-L40Sx1`; exact string must be verified via the Hyperstack console/API before updating `hyperstack-vm1.toml`. | | **VRAM** | 48 GB (vs 80 GB on the current A100). That is a hard ceiling for both model weights and KV cache. | | **Cost** | L40/L40S nodes are generally cheaper than A100/H100 on Hyperstack. Assuming the tiered pricing model, an L40 should reduce the hourly cost of VM1, but the final price depends on the exact `flavor_name` and any egress charges. | ## 2. Model choice: what actually fits on 48 GB The prompt mentions **Qwen3.6 MoE (e.g. 235B-A22B)**. A 235B-parameter model in BF16 would require **> 400 GB** of VRAM, which is impossible on a single L40. The only Qwen3.6 MoE that is publicly released and could *potentially* fit is **Qwen3.6-35B-A3B** (35B total / 3B active), but even that is **~70 GB in BF16**. **Realistic options to make it fit in 48 GB:** | Option | Weight size (est.) | Fit on 48 GB? | Notes | |---|---|---|---| | **AWQ 4-bit** Qwen3.6-35B-A3B | ~18 GB | Yes | Needs a community or official AWQ checkpoint (not yet listed as official at the time of writing, but AWQ/GPTQ variants usually appear quickly). | | **FP8** Qwen3.6-35B-A3B (if available) | ~35 GB | Tight | Leaves ~10 GB for KV cache, activations and CUDA graphs. vLLM profiling may tip it over. | | **Qwen3.6-27B dense** (current VM2 default) | ~27 GB FP8 | Yes | Not MoE; defeats the purpose of the task. | **Recommendation:** Target an **AWQ 4-bit (or GPTQ 4-bit) Qwen3.6-35B-A3B** checkpoint, or wait for an official **FP8** checkpoint and accept a reduced `max_model_len`. Do not attempt the 235B-A22B variant on a single L40. ## 3. vLLM + TurboQuant compatibility TurboQuant is a KV-cache compression backend in vLLM. Key upstream state: - **PR #39931** (merged 2026-05-05) added TurboQuant support for *hybrid* architectures (attention + Mamba/MoE). - **Issue #41726** reports a fatal crash during **chunked continuation prefill** on hybrid MoE models (e.g. Qwen3.5-9B NVFP4). Root cause: TurboQuant’s `_continuation_prefill` path requests workspace memory that was not reserved during warmup. - **PR #40798** is open as a candidate fix but **not yet merged**. **Implications for Qwen3.6-35B-A3B:** - Because Qwen3.6 uses a hybrid attention+Mamba architecture, it is in the exact class of models affected by #41726. - If TurboQuant is enabled (`--kv-cache-dtype turboquant_k8v4`, `--kv-cache-dtype turboquant_4bit_nc`, etc.), any long prompt that crosses a chunked-prefill boundary will likely trigger: ``` AssertionError: Workspace is locked but allocation ... requires X MB, current size is Y MB. ``` **Mitigations available today:** 1. **Disable chunked prefill:** Pass `--no-enable-chunked-prefill` in `extra_vllm_args`. This avoids the `_continuation_prefill` path entirely. Trade-off: large prefills are no longer split into chunks, which can increase latency for long inputs and may OOM if a single prefill is very large. 2. **Use `--enforce-eager`:** Disables CUDA graph capture, which slightly changes memory layout but does **not** solve the workspace lock issue by itself. It is useful mainly to save a few GB of VRAM on tight GPUs. 3. **Wait for PR #40798** to merge and land in a stable vLLM image. ## 4. Recommended `hyperstack-vm1.toml` changes (conceptual) ```toml [vm] # Verify exact flavor string with Hyperstack API before deploying. flavor_name = "n3-L40x1" # or n3-L40Sx1 labels = ["qwen36-moe", "wireguard"] [vllm] install = true model = "Qwen/Qwen3.6-35B-A3B-AWQ" # or the best available quantized MoE container_name = "vllm_qwen36_moe" max_model_len = 65536 # conservative for 48 GB; can raise if AWQ gpu_memory_utilization = 0.92 tensor_parallel_size = 1 tool_call_parser = "qwen3_coder" # TurboQuant KV cache on a hybrid MoE extra_vllm_args = [ "--reasoning-parser", "qwen3", "--kv-cache-dtype", "turboquant_k8v4", "--no-enable-chunked-prefill" # mitigation for issue #41726 ] # Nightly image post-PR-39931 is required; pin to a known-good digest until 0.20.2+ docker_image = "vllm/vllm-openai:nightly" ``` **VRAM estimate (AWQ 4-bit + TurboQuant K8V4 on L40 48 GB):** | Consumer | Est. size | |---|---| | AWQ weights (35B params @ 4-bit) | ~18 GB | | Activations / MoE routing / logits | ~4–6 GB | | CUDA graphs (if not eager) | ~2 GB | | KV cache (TurboQuant) | ~20–24 GB | | **Headroom** | **~0–4 GB** | Because headroom is thin, `gpu_memory_utilization=0.92` is appropriate. If profiling OOMs, raise it to `0.95` or drop `max_model_len`. If vLLM still OOMs during startup, try `--enforce-eager` to reclaim the CUDA-graph memory. ## 5. CLI and WireGuard implications | Area | Impact | |---|---| | `--vm 1 / 2 / both` | No structural changes. The CLI already resolves `hyperstack-vm1.toml` independently via its own state file. Switching the flavor/model is transparent to `--vm 2`. | | WireGuard | `wireguard_server_ip = "192.168.3.1"` stays the same. Recreating VM1 yields a new public IP, so the local `wg1.conf` peer endpoint must be refreshed (`ruby hyperstack.rb --vm 1 create` already handles this via `wg1-setup.sh`). The tunnel subnet `192.168.3.0/24` is unchanged. | | Port 11434 / firewall | Unchanged. Port 56710 UDP and 22 TCP remain locked to `allowed_wireguard_cidrs` / `allowed_ssh_cidrs`. | | Dual-VM routing | The client can continue to round-robin or fallback between `192.168.3.1` (VM1, MoE) and `192.168.3.3` (VM2, dense). No code changes needed. | ## 6. Risks | Risk | Severity | Mitigation | |---|---|---| | **TurboQuant crash (#41726)** on hybrid MoE | High | Disable chunked prefill now; migrate to fixed vLLM nightly once PR #40798 lands. | | **Model does not fit** in 48 GB if no AWQ/FP8 checkpoint exists | High | Confirm a 4-bit or FP8 checkpoint is on HuggingFace before provisioning. Fallback to Qwen3.6-27B dense (moves goalposts). | | **Performance regression** from no chunked prefill | Medium | Expect higher TTFB on long prompts. Monitor with `ruby hyperstack.rb --vm 1 test`. | | **Flavor unavailability** | Medium | Have a fallback flavor ready (e.g. `n3-A100x1` on VM1 if L40 is sold out), or accept A100 pricing. | | **Nightly Docker image instability** | Medium | Pin to a specific digest (`vllm/vllm-openai@sha256:...`) after first successful smoke test. | ## 7. Step-by-step migration plan (if you decide to proceed) 1. **Verify asset availability** - Confirm Hyperstack offers an L40 flavor and note its exact name. - Locate a Qwen3.6-35B-A3B AWQ/FP8 checkpoint on HuggingFace. If none exists, abort or pivot to the dense 27B. 2. **Snapshot / backup** - Ensure VM2 (A100 dense) is stable and passing tests (`ruby hyperstack.rb --vm 2 test`). - Save current VM1 state file as `.hyperstack-vm1-state.json.bak` in case a fast rollback is needed. 3. **Update configuration** - Edit `hyperstack-vm1.toml`: - `flavor_name` → L40 flavor. - `[vllm]` block → new model ID, container name, conservative `max_model_len`. - Add `docker_image = "vllm/vllm-openai:nightly"` (or a pinned digest). - Add TurboQuant arg and chunked-prefill mitigation to `extra_vllm_args`. - Update `[vm] labels` to reflect the new model. 4. **Provision** ```bash ruby hyperstack.rb --vm 1 create --replace ``` The `--replace` flag tears down the old A100 VM1 and rebuilds it on L40. 5. **Post-create validation** - Check WireGuard handshake: `sudo wg show wg1 latest-handshakes`. - Ping tunnel IP: `ping -c 3 192.168.3.1`. - Query vLLM: `curl -s http://192.168.3.1:11434/v1/models`. - Run the automated test suite: `ruby hyperstack.rb --vm 1 test`. 6. **Smoke test for TurboQuant stability** - Send a conversation with a very long system prompt (> 4096 tokens) and tool schemas to force a chunked-prefill boundary. - If the engine crashes with the workspace assertion, apply the fallback: - Add `--enforce-eager` to `extra_vllm_args`, or - Fall back to `--kv-cache-dtype fp8` (loses TurboQuant compression but is stable). 7. **Dual-VM confirmation** - Run `ruby hyperstack.rb --vm both test` to ensure both endpoints are healthy and reachable through the WireGuard tunnel. 8. **Monitor and iterate** - Watch VRAM usage with `nvidia-smi` inside the VM. - Adjust `max_model_len` and `gpu_memory_utilization` as needed. - Once upstream PR #40798 merges, rebuild the Docker image with the fixed vLLM version and re-enable chunked prefill. --- ## Bottom line The L40 is a cost-efficient target *if* a quantized Qwen3.6-35B-A3B checkpoint is available. The biggest blocker is the open vLLM issue #41726 (TurboQuant + hybrid MoE crash on chunked prefill). Disabling chunked prefill is a viable short-term workaround, but it comes with a latency trade-off and must be validated before making VM1 the default endpoint.