summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorPaul Buetow <paul@buetow.org>2026-05-24 14:02:34 +0300
committerPaul Buetow <paul@buetow.org>2026-05-24 14:02:34 +0300
commitc8bd4d1e7a34ebf452d3d6c843d5cef785abe608 (patch)
treeec1e6c19379c3ba86f6d80d90286eceae393b983
parentf16f4b753b3bf317e6da79f479ff5f506ed34b47 (diff)
replace qwen3-coder-next with qwen3.6-27b across configs, docs, and tooling
-rw-r--r--AGENTS.md4
-rw-r--r--README.md32
-rw-r--r--hyperstack-vm1.toml21
-rw-r--r--hyperstack-vm2.toml10
-rw-r--r--hypr.fish2
-rw-r--r--lib/hyperstack/config.rb4
-rw-r--r--logo.svg2
-rw-r--r--pi/agent/extensions/nemotron-tool-repair/index.ts2
-rw-r--r--pi/agent/models.json353
9 files changed, 321 insertions, 109 deletions
diff --git a/AGENTS.md b/AGENTS.md
index ced462c..f7f3491 100644
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -168,7 +168,7 @@ inference is ready. On an A100 with a warm HuggingFace cache:
**Monitor startup:**
```bash
-ssh ubuntu@<vm-public-ip> 'sudo docker logs -f vllm_qwen3 2>&1' \
+ssh ubuntu@<vm-public-ip> 'sudo docker logs -f vllm_qwen36_27b 2>&1' \
| grep -E "startup complete|Error|Loading|Downloading"
```
@@ -176,7 +176,7 @@ After `Application startup complete.`, the model responds immediately.
If the container crashes before that line, check for CUDA errors:
```bash
-ssh ubuntu@<vm-public-ip> 'sudo docker logs vllm_qwen3 2>&1 | grep -i "error\|cuda"'
+ssh ubuntu@<vm-public-ip> 'sudo docker logs vllm_qwen36_27b 2>&1 | grep -i "error\|cuda"'
```
A `CUDA error: operation not permitted` on the first engine process (pid visible in
diff --git a/README.md b/README.md
index 0c0df1b..cdb4df4 100644
--- a/README.md
+++ b/README.md
@@ -27,7 +27,7 @@ Runs two A100 VMs concurrently — each serving a different model — with [Pi](
│ │ │ │ pane 0: pi-coder │ pane 1: pi-gemma4 │ │ │ │
│ │ │ │ │ │ │ │ │
│ │ │ │ Pi │ Pi │ │ │ │
- │ │ │ │ Qwen3-Coder-Next │ Gemma 4 31B │ │ │ │
+ │ │ │ │ Qwen3.6 27B FP8 │ Gemma 4 31B │ │ │ │
│ │ │ └──────────┬────────────┘└────────────┬───────────┘ │ │ │
│ │ │ │ OpenAI API │ OpenAI API │ │ │
│ │ │ │ /v1/chat/completions │ /v1/chat/completions│ │ │
@@ -45,7 +45,7 @@ Runs two A100 VMs concurrently — each serving a different model — with [Pi](
│ hyperstack1.wg1 │ │ hyperstack2.wg1 │
│ │ │ │
│ vLLM :11434 │ │ vLLM :11434 │
- │ Qwen3-Coder-Next │ │ Gemma 4 31B IT │
+ │ Qwen3.6 27B FP8 │ │ Gemma 4 31B IT │
│ (MoE, AWQ-4bit) │ │ (dense, AWQ-4bit) │
└──────────────────────────┘ └──────────────────────────┘
```
@@ -167,7 +167,7 @@ Source `hyperstack.fish` or copy the abbreviations into your Fish config:
```fish
abbr pi-hyperstack pi --model hyperstack/openai/gpt-oss-120b
-abbr pi-hyperstack-coder pi --model hyperstack1/bullpoint/Qwen3-Coder-Next-AWQ-4bit
+abbr pi-hyperstack-coder pi --model hyperstack1/Qwen/Qwen3.6-27B-FP8
abbr pi-hyperstack-qwen36 pi --model hyperstack2/Qwen/Qwen3.6-27B-FP8
abbr pi-hyperstack-gemma4 pi --model hyperstack2/cyankiwi/gemma-4-31B-it-AWQ-4bit
```
@@ -176,7 +176,7 @@ Then launch a session after the VM(s) are up:
```fish
pi-hyperstack # GPT-OSS 120B on VM1
-pi-hyperstack-coder # Qwen3-Coder-Next on VM1
+pi-hyperstack-coder # Qwen3.6 27B FP8 on VM1
pi-hyperstack-qwen36 # Qwen3.6 27B FP8 on VM2
pi-hyperstack-gemma4 # Gemma 4 31B on VM2
```
@@ -188,7 +188,7 @@ Three providers are defined, one per setup, each pointing at its vLLM endpoint o
| Provider | Base URL | Primary model |
|----------|----------|---------------|
| `hyperstack` | `http://hyperstack.wg1:11434/v1` | GPT-OSS 120B (single-VM) |
-| `hyperstack1` | `http://hyperstack1.wg1:11434/v1` | Qwen3-Coder-Next (default; presets in TOML) |
+| `hyperstack1` | `http://hyperstack1.wg1:11434/v1` | Qwen3.6 27B FP8 (default; presets in TOML) |
| `hyperstack2` | `http://hyperstack2.wg1:11434/v1` | Gemma 4 31B (default; presets in TOML) |
All model presets from the TOML configs are registered under each provider, so any
@@ -255,7 +255,7 @@ No API key or account required. Uses DuckDuckGo's free HTML endpoint.
| Config file | Default model | WireGuard IP | Hostname |
|---|---|---|---|
-| `hyperstack-vm1.toml` | Qwen3-Coder-Next (AWQ-4bit) | `192.168.3.1` | `hyperstack1.wg1` |
+| `hyperstack-vm1.toml` | Qwen3.6 27B FP8 | `192.168.3.1` | `hyperstack1.wg1` |
| `hyperstack-vm2.toml` | Gemma 4 31B IT (AWQ-4bit) | `192.168.3.3` | `hyperstack2.wg1` |
Each VM has independent state files so they can be managed separately:
@@ -270,8 +270,8 @@ ruby hyperstack.rb --vm 2 status
Each VM has named model presets in its TOML config. Hot-switch without reprovisioning:
```bash
-ruby hyperstack.rb --vm 1 model switch qwen3-coder-next
-ruby hyperstack.rb --vm 2 model switch qwen3-coder-next
+ruby hyperstack.rb --vm 1 model switch qwen36-27b
+ruby hyperstack.rb --vm 2 model switch qwen36-27b
```
Available presets (both VMs share the same set):
@@ -280,7 +280,7 @@ Available presets (both VMs share the same set):
|---|---|---|---|
| `gemma4-31b` | Gemma 4 31B IT (AWQ-4bit) | ~19 GB | 32K–128K (see TOML) |
| `nemotron-super` | Nemotron-3-Super 120B (Mamba+MoE, 12B active) | ~60 GB | 131K |
-| `qwen3-coder-next` | Qwen3-Coder-Next 80B (MoE, AWQ-4bit) | ~45 GB | 262K |
+| `qwen36-27b` | Qwen3.6 27B FP8 | ~45 GB | 262K |
| `gpt-oss-120b` | GPT-OSS 120B (MoE, MXFP4) | ~65 GB | 131K |
| `gpt-oss-20b` | GPT-OSS 20B (MoE, MXFP4) | ~14 GB | 65K |
| `qwen25-coder-32b` | Qwen2.5-Coder-32B-Instruct (AWQ) | ~18 GB | 32K |
@@ -349,7 +349,7 @@ ruby hyperstack.rb test --vm 1
ruby hyperstack.rb test --vm 2
# Launch Pi coding agents — one per terminal
-pi-hyperstack-coder # fish abbreviation → Qwen3-Coder-Next on VM1
+pi-hyperstack-coder # fish abbreviation → Qwen3.6 27B FP8 on VM1
pi-hyperstack-qwen36 # fish abbreviation → Qwen3.6 27B FP8 on VM2
pi-hyperstack-gemma4 # fish abbreviation → Gemma 4 31B on VM2
@@ -361,8 +361,8 @@ ruby hyperstack.rb delete --vm both
```bash
# Switch the running vLLM container to a different model preset
-ruby hyperstack.rb --vm 1 model switch qwen3-coder-next
-ruby hyperstack.rb --vm 2 model switch qwen3-coder-next
+ruby hyperstack.rb --vm 1 model switch qwen36-27b
+ruby hyperstack.rb --vm 2 model switch qwen36-27b
```
See the [VM configuration](#vm-configuration) and [Switching models](#switching-models)
@@ -403,7 +403,7 @@ docker run -d \
--restart always \
-v /ephemeral/hug:/root/.cache/huggingface \
vllm/vllm-openai:latest \
- --model bullpoint/Qwen3-Coder-Next-AWQ-4bit \
+ --model Qwen/Qwen3.6-27B-FP8 \
--tensor-parallel-size 1 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
@@ -445,7 +445,7 @@ curl -s http://localhost:11434/v1/models | python3 -m json.tool
curl -s http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer EMPTY" \
- -d '{"model":"bullpoint/Qwen3-Coder-Next-AWQ-4bit",
+ -d '{"model":"Qwen/Qwen3.6-27B-FP8",
"messages":[{"role":"user","content":"Hello"}],
"max_tokens":50}'
```
@@ -600,9 +600,9 @@ Search HuggingFace for vLLM-compatible quantized models:
## Performance characteristics
-Measured on A100 80 GB PCIe (single GPU) with Qwen3-Coder-Next AWQ 4-bit:
+Measured on A100 80 GB PCIe (single GPU) with Qwen3.6 27B FP8:
-| Metric | vLLM (AWQ 4-bit) | Ollama (Q4_K_M) |
+| Metric | vLLM (FP8) | Ollama (Q4_K_M) |
|--------|-------------------|-----------------|
| Prefill throughput | 5,000–11,000 tok/s | ~1,000 tok/s (est.) |
| Decode throughput | 40–99 tok/s | ~40 tok/s |
diff --git a/hyperstack-vm1.toml b/hyperstack-vm1.toml
index c6fb2df..75c313c 100644
--- a/hyperstack-vm1.toml
+++ b/hyperstack-vm1.toml
@@ -13,13 +13,13 @@ name_prefix = "hyperstack1"
hostname = "hyperstack1"
environment_name = "snonux-ollama"
-# A100-80GB single GPU for qwen3-coder-next (default); H100 fallback if n3-A100x1 unavailable.
+# A100-80GB single GPU for Qwen3.6 27B (default); H100 fallback if n3-A100x1 unavailable.
flavor_name = "n3-A100x1"
image_name = "Ubuntu Server 24.04 LTS R570 CUDA 12.8 with Docker"
assign_floating_ip = true
create_bootable_volume = false
enable_port_randomization = false
-labels = ["qwen3-coder-next", "wireguard"]
+labels = ["qwen36-27b", "wireguard"]
[ssh]
username = "ubuntu"
@@ -55,16 +55,16 @@ listen_host = "0.0.0.0:11434"
gpu_overhead_mb = 2000
num_parallel = 1
context_length = 32768
-pull_models = ["qwen3-coder-next", "qwen3-coder:30b", "gpt-oss:20b", "gpt-oss:120b", "nemotron-3-super"]
+pull_models = ["qwen36-27b", "qwen3-coder:30b", "gpt-oss:20b", "gpt-oss:120b", "nemotron-3-super"]
# vLLM serves one model via Docker on the OpenAI-compatible API.
-# VM1 defaults to qwen3-coder-next; use 'model switch' to load any other preset.
+# VM1 defaults to Qwen3.6 27B; use 'model switch' to load any other preset.
[vllm]
install = true
-model = "bullpoint/Qwen3-Coder-Next-AWQ-4bit"
+model = "Qwen/Qwen3.6-27B-FP8"
# HuggingFace model cache on ephemeral NVMe (fast; survives reboots on most providers).
hug_cache_dir = "/ephemeral/hug"
-container_name = "vllm_qwen3"
+container_name = "vllm_qwen36_27b"
max_model_len = 262144
gpu_memory_utilization = 0.92
tensor_parallel_size = 1
@@ -73,13 +73,16 @@ tool_call_parser = "qwen3_coder"
# Named model presets for 'ruby hyperstack.rb --vm 1 model switch <name>'.
# Each preset overrides the matching [vllm] field; unset fields fall back to [vllm] defaults.
-[vllm.presets.qwen3-coder-next]
-model = "bullpoint/Qwen3-Coder-Next-AWQ-4bit"
-container_name = "vllm_qwen3"
+# Qwen3.6-27B FP8 — dense 27B multimodal model with native 262K context.
+# Uses qwen3 reasoning parsing plus qwen3_coder tool calling on vLLM >=0.19.0.
+[vllm.presets.qwen36-27b]
+model = "Qwen/Qwen3.6-27B-FP8"
+container_name = "vllm_qwen36_27b"
max_model_len = 262144
gpu_memory_utilization = 0.92
tensor_parallel_size = 1
tool_call_parser = "qwen3_coder"
+extra_vllm_args = ["--reasoning-parser", "qwen3"]
# NVIDIA Nemotron-3-Super-120B-A12B AWQ 4-bit — hybrid Mamba+MoE (12B active / 120B total).
# Single-GPU (A100-80GB) config: tensor_parallel_size=1, context capped at 32k to fit in VRAM.
diff --git a/hyperstack-vm2.toml b/hyperstack-vm2.toml
index c3605ff..faa8054 100644
--- a/hyperstack-vm2.toml
+++ b/hyperstack-vm2.toml
@@ -55,7 +55,7 @@ listen_host = "0.0.0.0:11434"
gpu_overhead_mb = 2000
num_parallel = 1
context_length = 32768
-pull_models = ["qwen3-coder-next"]
+pull_models = ["qwen36-27b"]
# vLLM serves one model via Docker on the OpenAI-compatible API.
# VM2 defaults to Qwen3.6 27B; use 'model switch' to load any other preset.
@@ -102,14 +102,6 @@ docker_image = "vllm/vllm-openai:nightly"
pre_start_cmd = "pip install -q transformers==5.5.0 2>/dev/null"
extra_docker_env = ["CUDA_VISIBLE_DEVICES=0"]
-[vllm.presets.qwen3-coder-next]
-model = "bullpoint/Qwen3-Coder-Next-AWQ-4bit"
-container_name = "vllm_qwen3"
-max_model_len = 262144
-gpu_memory_utilization = 0.92
-tensor_parallel_size = 1
-tool_call_parser = "qwen3_coder"
-
# NVIDIA Nemotron-3-Super-120B-A12B AWQ 4-bit — hybrid Mamba+MoE (12B active / 120B total).
# ~60 GB weights on A100 80GB; ~13 GB remaining for KV cache at 0.92 utilisation.
# Uses NoPE so any context length is valid; capped at 131072 to keep KV cache within VRAM budget.
diff --git a/hypr.fish b/hypr.fish
index e2de7d2..60f8356 100644
--- a/hypr.fish
+++ b/hypr.fish
@@ -1,5 +1,5 @@
# Dual-VM setup (hyperstack-vm1/vm2.toml -> hyperstack1/2.wg1)
-abbr pi-hyperstack-coder pi --model hyperstack1/bullpoint/Qwen3-Coder-Next-AWQ-4bit
+abbr pi-hyperstack-coder pi --model hyperstack1/Qwen/Qwen3.6-27B-FP8
abbr pi-hyperstack-qwen36 pi --model hyperstack2/Qwen/Qwen3.6-27B-FP8
abbr pi-hyperstack-gemma4 pi --model hyperstack2/cyankiwi/gemma-4-31B-it-AWQ-4bit
abbr hyperstack-create ruby ~/git/hyperstack/hyperstack.rb create
diff --git a/lib/hyperstack/config.rb b/lib/hyperstack/config.rb
index ba143e7..7057b4f 100644
--- a/lib/hyperstack/config.rb
+++ b/lib/hyperstack/config.rb
@@ -85,9 +85,9 @@ module HyperstackVM
},
'vllm' => {
'install' => true,
- 'model' => 'bullpoint/Qwen3-Coder-Next-AWQ-4bit',
+ 'model' => 'Qwen/Qwen3.6-27B-FP8',
'hug_cache_dir' => '/ephemeral/hug',
- 'container_name' => 'vllm_qwen3',
+ 'container_name' => 'vllm_qwen36_27b',
'max_model_len' => 262_144,
'gpu_memory_utilization' => 0.92,
'tensor_parallel_size' => 1,
diff --git a/logo.svg b/logo.svg
index 6bf71b4..a405085 100644
--- a/logo.svg
+++ b/logo.svg
@@ -378,7 +378,7 @@
<tspan fill="#8b949e"> pi --model hyperstack2/qwen3</tspan>
</text>
<text x="1112" y="286" fill="#6e7681"> Connecting to hyperstack2.wg1…</text>
- <text x="1112" y="302" fill="#58a6ff"> » I am Qwen3-Coder, let's build!</text>
+ <text x="1112" y="302" fill="#58a6ff"> » I am Qwen3.6, let's build!</text>
<!-- Blinking cursor -->
<rect x="1112" y="322" width="8" height="14" fill="#58a6ff" opacity="0.8"/>
diff --git a/pi/agent/extensions/nemotron-tool-repair/index.ts b/pi/agent/extensions/nemotron-tool-repair/index.ts
index 9bb8f94..ae59a66 100644
--- a/pi/agent/extensions/nemotron-tool-repair/index.ts
+++ b/pi/agent/extensions/nemotron-tool-repair/index.ts
@@ -20,7 +20,7 @@ import type { ExtensionAPI, ExtensionContext } from "@mariozechner/pi-coding-age
const CUSTOM_API = "hyperstack-openai-completions-repaired";
const TARGET_PROVIDERS = new Set(["hyperstack1", "hyperstack2"]);
const NEMOTRON_MODEL_PATTERN = /NVIDIA-Nemotron-3-Super/i;
-// Matches all Qwen Coder variants (Qwen3-Coder-Next, Qwen3-Coder-30B, etc.)
+// Matches Qwen3 Coder variants (Qwen3-Coder-30B, etc.)
const QWEN_CODER_MODEL_PATTERN = /Qwen.*Coder/i;
const MODELS_JSON_PATH = path.resolve(
path.dirname(fileURLToPath(import.meta.url)),
diff --git a/pi/agent/models.json b/pi/agent/models.json
index 48cd0e9..a5e8200 100644
--- a/pi/agent/models.json
+++ b/pi/agent/models.json
@@ -14,8 +14,15 @@
"id": "openai/gpt-oss-120b",
"name": "GPT-OSS 120B [vm]",
"reasoning": true,
- "input": ["text"],
- "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 },
+ "input": [
+ "text"
+ ],
+ "cost": {
+ "input": 0,
+ "output": 0,
+ "cacheRead": 0,
+ "cacheWrite": 0
+ },
"contextWindow": 131072,
"maxTokens": 8192
},
@@ -23,17 +30,31 @@
"id": "openai/gpt-oss-20b",
"name": "GPT-OSS 20B [vm]",
"reasoning": false,
- "input": ["text"],
- "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 },
+ "input": [
+ "text"
+ ],
+ "cost": {
+ "input": 0,
+ "output": 0,
+ "cacheRead": 0,
+ "cacheWrite": 0
+ },
"contextWindow": 65536,
"maxTokens": 8192
},
{
- "id": "bullpoint/Qwen3-Coder-Next-AWQ-4bit",
- "name": "Qwen3 Coder Next [vm]",
+ "id": "Qwen/Qwen3.6-27B-FP8",
+ "name": "Qwen3.6 27B FP8 [vm]",
"reasoning": true,
- "input": ["text"],
- "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 },
+ "input": [
+ "text"
+ ],
+ "cost": {
+ "input": 0,
+ "output": 0,
+ "cacheRead": 0,
+ "cacheWrite": 0
+ },
"contextWindow": 262144,
"maxTokens": 8192,
"compat": {
@@ -47,8 +68,15 @@
"id": "cyankiwi/gemma-4-31B-it-AWQ-4bit",
"name": "Gemma 4 31B IT [vm]",
"reasoning": false,
- "input": ["text"],
- "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 },
+ "input": [
+ "text"
+ ],
+ "cost": {
+ "input": 0,
+ "output": 0,
+ "cacheRead": 0,
+ "cacheWrite": 0
+ },
"contextWindow": 131072,
"maxTokens": 8192
},
@@ -56,8 +84,15 @@
"id": "cyankiwi/NVIDIA-Nemotron-3-Super-120B-A12B-AWQ-4bit",
"name": "Nemotron 3 Super 120B [vm]",
"reasoning": false,
- "input": ["text"],
- "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 },
+ "input": [
+ "text"
+ ],
+ "cost": {
+ "input": 0,
+ "output": 0,
+ "cacheRead": 0,
+ "cacheWrite": 0
+ },
"contextWindow": 262144,
"maxTokens": 8192
},
@@ -65,8 +100,15 @@
"id": "Qwen/Qwen2.5-Coder-32B-Instruct-AWQ",
"name": "Qwen2.5 Coder 32B [vm]",
"reasoning": false,
- "input": ["text"],
- "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 },
+ "input": [
+ "text"
+ ],
+ "cost": {
+ "input": 0,
+ "output": 0,
+ "cacheRead": 0,
+ "cacheWrite": 0
+ },
"contextWindow": 32768,
"maxTokens": 8192
},
@@ -74,8 +116,15 @@
"id": "QuantTrio/Qwen3-Coder-30B-A3B-Instruct-AWQ",
"name": "Qwen3 Coder 30B [vm]",
"reasoning": true,
- "input": ["text"],
- "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 },
+ "input": [
+ "text"
+ ],
+ "cost": {
+ "input": 0,
+ "output": 0,
+ "cacheRead": 0,
+ "cacheWrite": 0
+ },
"contextWindow": 65536,
"maxTokens": 8192,
"compat": {
@@ -89,8 +138,15 @@
"id": "casperhansen/deepseek-r1-distill-qwen-32b-awq",
"name": "DeepSeek-R1-Distill 32B [vm]",
"reasoning": true,
- "input": ["text"],
- "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 },
+ "input": [
+ "text"
+ ],
+ "cost": {
+ "input": 0,
+ "output": 0,
+ "cacheRead": 0,
+ "cacheWrite": 0
+ },
"contextWindow": 32768,
"maxTokens": 8192
},
@@ -98,8 +154,15 @@
"id": "Qwen/Qwen3-32B-AWQ",
"name": "Qwen3 32B [vm]",
"reasoning": true,
- "input": ["text"],
- "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 },
+ "input": [
+ "text"
+ ],
+ "cost": {
+ "input": 0,
+ "output": 0,
+ "cacheRead": 0,
+ "cacheWrite": 0
+ },
"contextWindow": 32768,
"maxTokens": 8192,
"compat": {
@@ -113,8 +176,15 @@
"id": "cyankiwi/Devstral-Small-2507-AWQ-4bit",
"name": "Devstral Small 2507 [vm]",
"reasoning": false,
- "input": ["text"],
- "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 },
+ "input": [
+ "text"
+ ],
+ "cost": {
+ "input": 0,
+ "output": 0,
+ "cacheRead": 0,
+ "cacheWrite": 0
+ },
"contextWindow": 32768,
"maxTokens": 8192
}
@@ -134,8 +204,15 @@
"id": "cyankiwi/gemma-4-31B-it-AWQ-4bit",
"name": "Gemma 4 31B IT [vm1]",
"reasoning": false,
- "input": ["text"],
- "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 },
+ "input": [
+ "text"
+ ],
+ "cost": {
+ "input": 0,
+ "output": 0,
+ "cacheRead": 0,
+ "cacheWrite": 0
+ },
"contextWindow": 131072,
"maxTokens": 8192
},
@@ -143,17 +220,31 @@
"id": "cyankiwi/NVIDIA-Nemotron-3-Super-120B-A12B-AWQ-4bit",
"name": "Nemotron 3 Super 120B 1M [vm1]",
"reasoning": false,
- "input": ["text"],
- "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 },
+ "input": [
+ "text"
+ ],
+ "cost": {
+ "input": 0,
+ "output": 0,
+ "cacheRead": 0,
+ "cacheWrite": 0
+ },
"contextWindow": 1048576,
"maxTokens": 8192
},
{
- "id": "bullpoint/Qwen3-Coder-Next-AWQ-4bit",
- "name": "Qwen3 Coder Next [vm1]",
+ "id": "Qwen/Qwen3.6-27B-FP8",
+ "name": "Qwen3.6 27B FP8 [vm1]",
"reasoning": true,
- "input": ["text"],
- "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 },
+ "input": [
+ "text"
+ ],
+ "cost": {
+ "input": 0,
+ "output": 0,
+ "cacheRead": 0,
+ "cacheWrite": 0
+ },
"contextWindow": 262144,
"maxTokens": 8192,
"compat": {
@@ -167,8 +258,15 @@
"id": "openai/gpt-oss-20b",
"name": "GPT-OSS 20B [vm1]",
"reasoning": false,
- "input": ["text"],
- "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 },
+ "input": [
+ "text"
+ ],
+ "cost": {
+ "input": 0,
+ "output": 0,
+ "cacheRead": 0,
+ "cacheWrite": 0
+ },
"contextWindow": 65536,
"maxTokens": 8192
},
@@ -176,8 +274,15 @@
"id": "openai/gpt-oss-120b",
"name": "GPT-OSS 120B [vm1]",
"reasoning": true,
- "input": ["text"],
- "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 },
+ "input": [
+ "text"
+ ],
+ "cost": {
+ "input": 0,
+ "output": 0,
+ "cacheRead": 0,
+ "cacheWrite": 0
+ },
"contextWindow": 131072,
"maxTokens": 8192
},
@@ -185,8 +290,15 @@
"id": "Qwen/Qwen2.5-Coder-32B-Instruct-AWQ",
"name": "Qwen2.5 Coder 32B [vm1]",
"reasoning": false,
- "input": ["text"],
- "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 },
+ "input": [
+ "text"
+ ],
+ "cost": {
+ "input": 0,
+ "output": 0,
+ "cacheRead": 0,
+ "cacheWrite": 0
+ },
"contextWindow": 32768,
"maxTokens": 8192
},
@@ -194,8 +306,15 @@
"id": "QuantTrio/Qwen3-Coder-30B-A3B-Instruct-AWQ",
"name": "Qwen3 Coder 30B [vm1]",
"reasoning": true,
- "input": ["text"],
- "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 },
+ "input": [
+ "text"
+ ],
+ "cost": {
+ "input": 0,
+ "output": 0,
+ "cacheRead": 0,
+ "cacheWrite": 0
+ },
"contextWindow": 65536,
"maxTokens": 8192,
"compat": {
@@ -209,8 +328,15 @@
"id": "casperhansen/deepseek-r1-distill-qwen-32b-awq",
"name": "DeepSeek-R1-Distill 32B [vm1]",
"reasoning": true,
- "input": ["text"],
- "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 },
+ "input": [
+ "text"
+ ],
+ "cost": {
+ "input": 0,
+ "output": 0,
+ "cacheRead": 0,
+ "cacheWrite": 0
+ },
"contextWindow": 32768,
"maxTokens": 8192
},
@@ -218,8 +344,15 @@
"id": "Qwen/Qwen3-32B-AWQ",
"name": "Qwen3 32B [vm1]",
"reasoning": true,
- "input": ["text"],
- "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 },
+ "input": [
+ "text"
+ ],
+ "cost": {
+ "input": 0,
+ "output": 0,
+ "cacheRead": 0,
+ "cacheWrite": 0
+ },
"contextWindow": 32768,
"maxTokens": 8192,
"compat": {
@@ -233,8 +366,15 @@
"id": "cyankiwi/Devstral-Small-2507-AWQ-4bit",
"name": "Devstral Small 2507 [vm1]",
"reasoning": false,
- "input": ["text"],
- "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 },
+ "input": [
+ "text"
+ ],
+ "cost": {
+ "input": 0,
+ "output": 0,
+ "cacheRead": 0,
+ "cacheWrite": 0
+ },
"contextWindow": 32768,
"maxTokens": 8192
}
@@ -254,8 +394,15 @@
"id": "Qwen/Qwen3.6-27B-FP8",
"name": "Qwen3.6 27B FP8 [vm2]",
"reasoning": true,
- "input": ["text"],
- "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 },
+ "input": [
+ "text"
+ ],
+ "cost": {
+ "input": 0,
+ "output": 0,
+ "cacheRead": 0,
+ "cacheWrite": 0
+ },
"contextWindow": 262144,
"maxTokens": 8192,
"compat": {
@@ -269,17 +416,31 @@
"id": "cyankiwi/gemma-4-31B-it-AWQ-4bit",
"name": "Gemma 4 31B IT [vm2]",
"reasoning": false,
- "input": ["text"],
- "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 },
+ "input": [
+ "text"
+ ],
+ "cost": {
+ "input": 0,
+ "output": 0,
+ "cacheRead": 0,
+ "cacheWrite": 0
+ },
"contextWindow": 131072,
"maxTokens": 8192
},
{
- "id": "bullpoint/Qwen3-Coder-Next-AWQ-4bit",
- "name": "Qwen3 Coder Next [vm2]",
+ "id": "Qwen/Qwen3.6-27B-FP8",
+ "name": "Qwen3.6 27B FP8 [vm2]",
"reasoning": true,
- "input": ["text"],
- "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 },
+ "input": [
+ "text"
+ ],
+ "cost": {
+ "input": 0,
+ "output": 0,
+ "cacheRead": 0,
+ "cacheWrite": 0
+ },
"contextWindow": 262144,
"maxTokens": 8192,
"compat": {
@@ -293,8 +454,15 @@
"id": "cyankiwi/NVIDIA-Nemotron-3-Super-120B-A12B-AWQ-4bit",
"name": "Nemotron 3 Super 120B [vm2]",
"reasoning": false,
- "input": ["text"],
- "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 },
+ "input": [
+ "text"
+ ],
+ "cost": {
+ "input": 0,
+ "output": 0,
+ "cacheRead": 0,
+ "cacheWrite": 0
+ },
"contextWindow": 262144,
"maxTokens": 8192
},
@@ -302,8 +470,15 @@
"id": "openai/gpt-oss-20b",
"name": "GPT-OSS 20B [vm2]",
"reasoning": false,
- "input": ["text"],
- "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 },
+ "input": [
+ "text"
+ ],
+ "cost": {
+ "input": 0,
+ "output": 0,
+ "cacheRead": 0,
+ "cacheWrite": 0
+ },
"contextWindow": 65536,
"maxTokens": 8192
},
@@ -311,8 +486,15 @@
"id": "openai/gpt-oss-120b",
"name": "GPT-OSS 120B [vm2]",
"reasoning": true,
- "input": ["text"],
- "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 },
+ "input": [
+ "text"
+ ],
+ "cost": {
+ "input": 0,
+ "output": 0,
+ "cacheRead": 0,
+ "cacheWrite": 0
+ },
"contextWindow": 131072,
"maxTokens": 8192
},
@@ -320,8 +502,15 @@
"id": "Qwen/Qwen2.5-Coder-32B-Instruct-AWQ",
"name": "Qwen2.5 Coder 32B [vm2]",
"reasoning": false,
- "input": ["text"],
- "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 },
+ "input": [
+ "text"
+ ],
+ "cost": {
+ "input": 0,
+ "output": 0,
+ "cacheRead": 0,
+ "cacheWrite": 0
+ },
"contextWindow": 32768,
"maxTokens": 8192
},
@@ -329,8 +518,15 @@
"id": "QuantTrio/Qwen3-Coder-30B-A3B-Instruct-AWQ",
"name": "Qwen3 Coder 30B [vm2]",
"reasoning": true,
- "input": ["text"],
- "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 },
+ "input": [
+ "text"
+ ],
+ "cost": {
+ "input": 0,
+ "output": 0,
+ "cacheRead": 0,
+ "cacheWrite": 0
+ },
"contextWindow": 65536,
"maxTokens": 8192,
"compat": {
@@ -344,8 +540,15 @@
"id": "casperhansen/deepseek-r1-distill-qwen-32b-awq",
"name": "DeepSeek-R1-Distill 32B [vm2]",
"reasoning": true,
- "input": ["text"],
- "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 },
+ "input": [
+ "text"
+ ],
+ "cost": {
+ "input": 0,
+ "output": 0,
+ "cacheRead": 0,
+ "cacheWrite": 0
+ },
"contextWindow": 32768,
"maxTokens": 8192
},
@@ -353,8 +556,15 @@
"id": "Qwen/Qwen3-32B-AWQ",
"name": "Qwen3 32B [vm2]",
"reasoning": true,
- "input": ["text"],
- "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 },
+ "input": [
+ "text"
+ ],
+ "cost": {
+ "input": 0,
+ "output": 0,
+ "cacheRead": 0,
+ "cacheWrite": 0
+ },
"contextWindow": 32768,
"maxTokens": 8192,
"compat": {
@@ -368,8 +578,15 @@
"id": "cyankiwi/Devstral-Small-2507-AWQ-4bit",
"name": "Devstral Small 2507 [vm2]",
"reasoning": false,
- "input": ["text"],
- "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 },
+ "input": [
+ "text"
+ ],
+ "cost": {
+ "input": 0,
+ "output": 0,
+ "cacheRead": 0,
+ "cacheWrite": 0
+ },
"contextWindow": 32768,
"maxTokens": 8192
}