PLAN-L40.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157

# Plan: VM1 on Hyperstack L40 with Qwen3.6 MoE + TurboQuant

**Prepared:** 2026-05-24  
**Scope:** Research and planning only — no code changes, no provisioning.

---

## 1. GPU and VM sizing (Hyperstack L40)

| Item | Assessment |
|---|---|
| **Flavor** | Hyperstack’s GPU flavors use the `n3-*` prefix (see current `n3-A100x1` / `n3-H100x1`). The L40 48 GB flavor is expected to be named `n3-L40x1` or `n3-L40Sx1`; exact string must be verified via the Hyperstack console/API before updating `hyperstack-vm1.toml`. |
| **VRAM** | 48 GB (vs 80 GB on the current A100). That is a hard ceiling for both model weights and KV cache. |
| **Cost** | L40/L40S nodes are generally cheaper than A100/H100 on Hyperstack. Assuming the tiered pricing model, an L40 should reduce the hourly cost of VM1, but the final price depends on the exact `flavor_name` and any egress charges. |

## 2. Model choice: what actually fits on 48 GB

The prompt mentions **Qwen3.6 MoE (e.g. 235B-A22B)**. A 235B-parameter model in BF16 would require **> 400 GB** of VRAM, which is impossible on a single L40. The only Qwen3.6 MoE that is publicly released and could *potentially* fit is **Qwen3.6-35B-A3B** (35B total / 3B active), but even that is **~70 GB in BF16**.

**Realistic options to make it fit in 48 GB:**

| Option | Weight size (est.) | Fit on 48 GB? | Notes |
|---|---|---|---|
| **AWQ 4-bit** Qwen3.6-35B-A3B | ~18 GB | Yes | Needs a community or official AWQ checkpoint (not yet listed as official at the time of writing, but AWQ/GPTQ variants usually appear quickly). |
| **FP8** Qwen3.6-35B-A3B (if available) | ~35 GB | Tight | Leaves ~10 GB for KV cache, activations and CUDA graphs. vLLM profiling may tip it over. |
| **Qwen3.6-27B dense** (current VM2 default) | ~27 GB FP8 | Yes | Not MoE; defeats the purpose of the task. |

**Recommendation:** Target an **AWQ 4-bit (or GPTQ 4-bit) Qwen3.6-35B-A3B** checkpoint, or wait for an official **FP8** checkpoint and accept a reduced `max_model_len`. Do not attempt the 235B-A22B variant on a single L40.

## 3. vLLM + TurboQuant compatibility

TurboQuant is a KV-cache compression backend in vLLM. Key upstream state:

- **PR #39931** (merged 2026-05-05) added TurboQuant support for *hybrid* architectures (attention + Mamba/MoE).
- **Issue #41726** reports a fatal crash during **chunked continuation prefill** on hybrid MoE models (e.g. Qwen3.5-9B NVFP4). Root cause: TurboQuant’s `_continuation_prefill` path requests workspace memory that was not reserved during warmup.
- **PR #40798** is open as a candidate fix but **not yet merged**.

**Implications for Qwen3.6-35B-A3B:**
- Because Qwen3.6 uses a hybrid attention+Mamba architecture, it is in the exact class of models affected by #41726.
- If TurboQuant is enabled (`--kv-cache-dtype turboquant_k8v4`, `--kv-cache-dtype turboquant_4bit_nc`, etc.), any long prompt that crosses a chunked-prefill boundary will likely trigger:
  ```
  AssertionError: Workspace is locked but allocation ... requires X MB, current size is Y MB.
  ```

**Mitigations available today:**
1. **Disable chunked prefill:** Pass `--no-enable-chunked-prefill` in `extra_vllm_args`. This avoids the `_continuation_prefill` path entirely. Trade-off: large prefills are no longer split into chunks, which can increase latency for long inputs and may OOM if a single prefill is very large.
2. **Use `--enforce-eager`:** Disables CUDA graph capture, which slightly changes memory layout but does **not** solve the workspace lock issue by itself. It is useful mainly to save a few GB of VRAM on tight GPUs.
3. **Wait for PR #40798** to merge and land in a stable vLLM image.

## 4. Recommended `hyperstack-vm1.toml` changes (conceptual)

```toml
[vm]
# Verify exact flavor string with Hyperstack API before deploying.
flavor_name = "n3-L40x1"          # or n3-L40Sx1
labels = ["qwen36-moe", "wireguard"]

[vllm]
install = true
model = "Qwen/Qwen3.6-35B-A3B-AWQ"   # or the best available quantized MoE
container_name = "vllm_qwen36_moe"
max_model_len = 65536                  # conservative for 48 GB; can raise if AWQ
gpu_memory_utilization = 0.92
tensor_parallel_size = 1
tool_call_parser = "qwen3_coder"

# TurboQuant KV cache on a hybrid MoE
extra_vllm_args = [
  "--reasoning-parser", "qwen3",
  "--kv-cache-dtype", "turboquant_k8v4",
  "--no-enable-chunked-prefill"        # mitigation for issue #41726
]

# Nightly image post-PR-39931 is required; pin to a known-good digest until 0.20.2+
docker_image = "vllm/vllm-openai:nightly"
```

**VRAM estimate (AWQ 4-bit + TurboQuant K8V4 on L40 48 GB):**

| Consumer | Est. size |
|---|---|
| AWQ weights (35B params @ 4-bit) | ~18 GB |
| Activations / MoE routing / logits | ~4–6 GB |
| CUDA graphs (if not eager) | ~2 GB |
| KV cache (TurboQuant) | ~20–24 GB |
| **Headroom** | **~0–4 GB** |

Because headroom is thin, `gpu_memory_utilization=0.92` is appropriate. If profiling OOMs, raise it to `0.95` or drop `max_model_len`. If vLLM still OOMs during startup, try `--enforce-eager` to reclaim the CUDA-graph memory.

## 5. CLI and WireGuard implications

| Area | Impact |
|---|---|
| `--vm 1 / 2 / both` | No structural changes. The CLI already resolves `hyperstack-vm1.toml` independently via its own state file. Switching the flavor/model is transparent to `--vm 2`. |
| WireGuard | `wireguard_server_ip = "192.168.3.1"` stays the same. Recreating VM1 yields a new public IP, so the local `wg1.conf` peer endpoint must be refreshed (`ruby hyperstack.rb --vm 1 create` already handles this via `wg1-setup.sh`). The tunnel subnet `192.168.3.0/24` is unchanged. |
| Port 11434 / firewall | Unchanged. Port 56710 UDP and 22 TCP remain locked to `allowed_wireguard_cidrs` / `allowed_ssh_cidrs`. |
| Dual-VM routing | The client can continue to round-robin or fallback between `192.168.3.1` (VM1, MoE) and `192.168.3.3` (VM2, dense). No code changes needed. |

## 6. Risks

| Risk | Severity | Mitigation |
|---|---|---|
| **TurboQuant crash (#41726)** on hybrid MoE | High | Disable chunked prefill now; migrate to fixed vLLM nightly once PR #40798 lands. |
| **Model does not fit** in 48 GB if no AWQ/FP8 checkpoint exists | High | Confirm a 4-bit or FP8 checkpoint is on HuggingFace before provisioning. Fallback to Qwen3.6-27B dense (moves goalposts). |
| **Performance regression** from no chunked prefill | Medium | Expect higher TTFB on long prompts. Monitor with `ruby hyperstack.rb --vm 1 test`. |
| **Flavor unavailability** | Medium | Have a fallback flavor ready (e.g. `n3-A100x1` on VM1 if L40 is sold out), or accept A100 pricing. |
| **Nightly Docker image instability** | Medium | Pin to a specific digest (`vllm/vllm-openai@sha256:...`) after first successful smoke test. |

## 7. Step-by-step migration plan (if you decide to proceed)

1. **Verify asset availability**
   - Confirm Hyperstack offers an L40 flavor and note its exact name.
   - Locate a Qwen3.6-35B-A3B AWQ/FP8 checkpoint on HuggingFace. If none exists, abort or pivot to the dense 27B.

2. **Snapshot / backup**
   - Ensure VM2 (A100 dense) is stable and passing tests (`ruby hyperstack.rb --vm 2 test`).
   - Save current VM1 state file as `.hyperstack-vm1-state.json.bak` in case a fast rollback is needed.

3. **Update configuration**
   - Edit `hyperstack-vm1.toml`:
     - `flavor_name` → L40 flavor.
     - `[vllm]` block → new model ID, container name, conservative `max_model_len`.
     - Add `docker_image = "vllm/vllm-openai:nightly"` (or a pinned digest).
     - Add TurboQuant arg and chunked-prefill mitigation to `extra_vllm_args`.
   - Update `[vm] labels` to reflect the new model.

4. **Provision**
   ```bash
   ruby hyperstack.rb --vm 1 create --replace
   ```
   The `--replace` flag tears down the old A100 VM1 and rebuilds it on L40.

5. **Post-create validation**
   - Check WireGuard handshake: `sudo wg show wg1 latest-handshakes`.
   - Ping tunnel IP: `ping -c 3 192.168.3.1`.
   - Query vLLM: `curl -s http://192.168.3.1:11434/v1/models`.
   - Run the automated test suite: `ruby hyperstack.rb --vm 1 test`.

6. **Smoke test for TurboQuant stability**
   - Send a conversation with a very long system prompt (> 4096 tokens) and tool schemas to force a chunked-prefill boundary.
   - If the engine crashes with the workspace assertion, apply the fallback:
     - Add `--enforce-eager` to `extra_vllm_args`, or
     - Fall back to `--kv-cache-dtype fp8` (loses TurboQuant compression but is stable).

7. **Dual-VM confirmation**
   - Run `ruby hyperstack.rb --vm both test` to ensure both endpoints are healthy and reachable through the WireGuard tunnel.

8. **Monitor and iterate**
   - Watch VRAM usage with `nvidia-smi` inside the VM.
   - Adjust `max_model_len` and `gpu_memory_utilization` as needed.
   - Once upstream PR #40798 merges, rebuild the Docker image with the fixed vLLM version and re-enable chunked prefill.

---

## Bottom line

The L40 is a cost-efficient target *if* a quantized Qwen3.6-35B-A3B checkpoint is available. The biggest blocker is the open vLLM issue #41726 (TurboQuant + hybrid MoE crash on chunked prefill). Disabling chunked prefill is a viable short-term workaround, but it comes with a latency trade-off and must be validated before making VM1 the default endpoint.