diff options
| author | Paul Buetow <paul@buetow.org> | 2026-05-24 14:02:34 +0300 |
|---|---|---|
| committer | Paul Buetow <paul@buetow.org> | 2026-05-24 14:02:34 +0300 |
| commit | c8bd4d1e7a34ebf452d3d6c843d5cef785abe608 (patch) | |
| tree | ec1e6c19379c3ba86f6d80d90286eceae393b983 | |
| parent | f16f4b753b3bf317e6da79f479ff5f506ed34b47 (diff) | |
replace qwen3-coder-next with qwen3.6-27b across configs, docs, and tooling
| -rw-r--r-- | AGENTS.md | 4 | ||||
| -rw-r--r-- | README.md | 32 | ||||
| -rw-r--r-- | hyperstack-vm1.toml | 21 | ||||
| -rw-r--r-- | hyperstack-vm2.toml | 10 | ||||
| -rw-r--r-- | hypr.fish | 2 | ||||
| -rw-r--r-- | lib/hyperstack/config.rb | 4 | ||||
| -rw-r--r-- | logo.svg | 2 | ||||
| -rw-r--r-- | pi/agent/extensions/nemotron-tool-repair/index.ts | 2 | ||||
| -rw-r--r-- | pi/agent/models.json | 353 |
9 files changed, 321 insertions, 109 deletions
@@ -168,7 +168,7 @@ inference is ready. On an A100 with a warm HuggingFace cache: **Monitor startup:** ```bash -ssh ubuntu@<vm-public-ip> 'sudo docker logs -f vllm_qwen3 2>&1' \ +ssh ubuntu@<vm-public-ip> 'sudo docker logs -f vllm_qwen36_27b 2>&1' \ | grep -E "startup complete|Error|Loading|Downloading" ``` @@ -176,7 +176,7 @@ After `Application startup complete.`, the model responds immediately. If the container crashes before that line, check for CUDA errors: ```bash -ssh ubuntu@<vm-public-ip> 'sudo docker logs vllm_qwen3 2>&1 | grep -i "error\|cuda"' +ssh ubuntu@<vm-public-ip> 'sudo docker logs vllm_qwen36_27b 2>&1 | grep -i "error\|cuda"' ``` A `CUDA error: operation not permitted` on the first engine process (pid visible in @@ -27,7 +27,7 @@ Runs two A100 VMs concurrently — each serving a different model — with [Pi]( │ │ │ │ pane 0: pi-coder │ pane 1: pi-gemma4 │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ Pi │ Pi │ │ │ │ - │ │ │ │ Qwen3-Coder-Next │ Gemma 4 31B │ │ │ │ + │ │ │ │ Qwen3.6 27B FP8 │ Gemma 4 31B │ │ │ │ │ │ │ └──────────┬────────────┘└────────────┬───────────┘ │ │ │ │ │ │ │ OpenAI API │ OpenAI API │ │ │ │ │ │ │ /v1/chat/completions │ /v1/chat/completions│ │ │ @@ -45,7 +45,7 @@ Runs two A100 VMs concurrently — each serving a different model — with [Pi]( │ hyperstack1.wg1 │ │ hyperstack2.wg1 │ │ │ │ │ │ vLLM :11434 │ │ vLLM :11434 │ - │ Qwen3-Coder-Next │ │ Gemma 4 31B IT │ + │ Qwen3.6 27B FP8 │ │ Gemma 4 31B IT │ │ (MoE, AWQ-4bit) │ │ (dense, AWQ-4bit) │ └──────────────────────────┘ └──────────────────────────┘ ``` @@ -167,7 +167,7 @@ Source `hyperstack.fish` or copy the abbreviations into your Fish config: ```fish abbr pi-hyperstack pi --model hyperstack/openai/gpt-oss-120b -abbr pi-hyperstack-coder pi --model hyperstack1/bullpoint/Qwen3-Coder-Next-AWQ-4bit +abbr pi-hyperstack-coder pi --model hyperstack1/Qwen/Qwen3.6-27B-FP8 abbr pi-hyperstack-qwen36 pi --model hyperstack2/Qwen/Qwen3.6-27B-FP8 abbr pi-hyperstack-gemma4 pi --model hyperstack2/cyankiwi/gemma-4-31B-it-AWQ-4bit ``` @@ -176,7 +176,7 @@ Then launch a session after the VM(s) are up: ```fish pi-hyperstack # GPT-OSS 120B on VM1 -pi-hyperstack-coder # Qwen3-Coder-Next on VM1 +pi-hyperstack-coder # Qwen3.6 27B FP8 on VM1 pi-hyperstack-qwen36 # Qwen3.6 27B FP8 on VM2 pi-hyperstack-gemma4 # Gemma 4 31B on VM2 ``` @@ -188,7 +188,7 @@ Three providers are defined, one per setup, each pointing at its vLLM endpoint o | Provider | Base URL | Primary model | |----------|----------|---------------| | `hyperstack` | `http://hyperstack.wg1:11434/v1` | GPT-OSS 120B (single-VM) | -| `hyperstack1` | `http://hyperstack1.wg1:11434/v1` | Qwen3-Coder-Next (default; presets in TOML) | +| `hyperstack1` | `http://hyperstack1.wg1:11434/v1` | Qwen3.6 27B FP8 (default; presets in TOML) | | `hyperstack2` | `http://hyperstack2.wg1:11434/v1` | Gemma 4 31B (default; presets in TOML) | All model presets from the TOML configs are registered under each provider, so any @@ -255,7 +255,7 @@ No API key or account required. Uses DuckDuckGo's free HTML endpoint. | Config file | Default model | WireGuard IP | Hostname | |---|---|---|---| -| `hyperstack-vm1.toml` | Qwen3-Coder-Next (AWQ-4bit) | `192.168.3.1` | `hyperstack1.wg1` | +| `hyperstack-vm1.toml` | Qwen3.6 27B FP8 | `192.168.3.1` | `hyperstack1.wg1` | | `hyperstack-vm2.toml` | Gemma 4 31B IT (AWQ-4bit) | `192.168.3.3` | `hyperstack2.wg1` | Each VM has independent state files so they can be managed separately: @@ -270,8 +270,8 @@ ruby hyperstack.rb --vm 2 status Each VM has named model presets in its TOML config. Hot-switch without reprovisioning: ```bash -ruby hyperstack.rb --vm 1 model switch qwen3-coder-next -ruby hyperstack.rb --vm 2 model switch qwen3-coder-next +ruby hyperstack.rb --vm 1 model switch qwen36-27b +ruby hyperstack.rb --vm 2 model switch qwen36-27b ``` Available presets (both VMs share the same set): @@ -280,7 +280,7 @@ Available presets (both VMs share the same set): |---|---|---|---| | `gemma4-31b` | Gemma 4 31B IT (AWQ-4bit) | ~19 GB | 32K–128K (see TOML) | | `nemotron-super` | Nemotron-3-Super 120B (Mamba+MoE, 12B active) | ~60 GB | 131K | -| `qwen3-coder-next` | Qwen3-Coder-Next 80B (MoE, AWQ-4bit) | ~45 GB | 262K | +| `qwen36-27b` | Qwen3.6 27B FP8 | ~45 GB | 262K | | `gpt-oss-120b` | GPT-OSS 120B (MoE, MXFP4) | ~65 GB | 131K | | `gpt-oss-20b` | GPT-OSS 20B (MoE, MXFP4) | ~14 GB | 65K | | `qwen25-coder-32b` | Qwen2.5-Coder-32B-Instruct (AWQ) | ~18 GB | 32K | @@ -349,7 +349,7 @@ ruby hyperstack.rb test --vm 1 ruby hyperstack.rb test --vm 2 # Launch Pi coding agents — one per terminal -pi-hyperstack-coder # fish abbreviation → Qwen3-Coder-Next on VM1 +pi-hyperstack-coder # fish abbreviation → Qwen3.6 27B FP8 on VM1 pi-hyperstack-qwen36 # fish abbreviation → Qwen3.6 27B FP8 on VM2 pi-hyperstack-gemma4 # fish abbreviation → Gemma 4 31B on VM2 @@ -361,8 +361,8 @@ ruby hyperstack.rb delete --vm both ```bash # Switch the running vLLM container to a different model preset -ruby hyperstack.rb --vm 1 model switch qwen3-coder-next -ruby hyperstack.rb --vm 2 model switch qwen3-coder-next +ruby hyperstack.rb --vm 1 model switch qwen36-27b +ruby hyperstack.rb --vm 2 model switch qwen36-27b ``` See the [VM configuration](#vm-configuration) and [Switching models](#switching-models) @@ -403,7 +403,7 @@ docker run -d \ --restart always \ -v /ephemeral/hug:/root/.cache/huggingface \ vllm/vllm-openai:latest \ - --model bullpoint/Qwen3-Coder-Next-AWQ-4bit \ + --model Qwen/Qwen3.6-27B-FP8 \ --tensor-parallel-size 1 \ --enable-auto-tool-choice \ --tool-call-parser qwen3_coder \ @@ -445,7 +445,7 @@ curl -s http://localhost:11434/v1/models | python3 -m json.tool curl -s http://localhost:11434/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer EMPTY" \ - -d '{"model":"bullpoint/Qwen3-Coder-Next-AWQ-4bit", + -d '{"model":"Qwen/Qwen3.6-27B-FP8", "messages":[{"role":"user","content":"Hello"}], "max_tokens":50}' ``` @@ -600,9 +600,9 @@ Search HuggingFace for vLLM-compatible quantized models: ## Performance characteristics -Measured on A100 80 GB PCIe (single GPU) with Qwen3-Coder-Next AWQ 4-bit: +Measured on A100 80 GB PCIe (single GPU) with Qwen3.6 27B FP8: -| Metric | vLLM (AWQ 4-bit) | Ollama (Q4_K_M) | +| Metric | vLLM (FP8) | Ollama (Q4_K_M) | |--------|-------------------|-----------------| | Prefill throughput | 5,000–11,000 tok/s | ~1,000 tok/s (est.) | | Decode throughput | 40–99 tok/s | ~40 tok/s | diff --git a/hyperstack-vm1.toml b/hyperstack-vm1.toml index c6fb2df..75c313c 100644 --- a/hyperstack-vm1.toml +++ b/hyperstack-vm1.toml @@ -13,13 +13,13 @@ name_prefix = "hyperstack1" hostname = "hyperstack1" environment_name = "snonux-ollama" -# A100-80GB single GPU for qwen3-coder-next (default); H100 fallback if n3-A100x1 unavailable. +# A100-80GB single GPU for Qwen3.6 27B (default); H100 fallback if n3-A100x1 unavailable. flavor_name = "n3-A100x1" image_name = "Ubuntu Server 24.04 LTS R570 CUDA 12.8 with Docker" assign_floating_ip = true create_bootable_volume = false enable_port_randomization = false -labels = ["qwen3-coder-next", "wireguard"] +labels = ["qwen36-27b", "wireguard"] [ssh] username = "ubuntu" @@ -55,16 +55,16 @@ listen_host = "0.0.0.0:11434" gpu_overhead_mb = 2000 num_parallel = 1 context_length = 32768 -pull_models = ["qwen3-coder-next", "qwen3-coder:30b", "gpt-oss:20b", "gpt-oss:120b", "nemotron-3-super"] +pull_models = ["qwen36-27b", "qwen3-coder:30b", "gpt-oss:20b", "gpt-oss:120b", "nemotron-3-super"] # vLLM serves one model via Docker on the OpenAI-compatible API. -# VM1 defaults to qwen3-coder-next; use 'model switch' to load any other preset. +# VM1 defaults to Qwen3.6 27B; use 'model switch' to load any other preset. [vllm] install = true -model = "bullpoint/Qwen3-Coder-Next-AWQ-4bit" +model = "Qwen/Qwen3.6-27B-FP8" # HuggingFace model cache on ephemeral NVMe (fast; survives reboots on most providers). hug_cache_dir = "/ephemeral/hug" -container_name = "vllm_qwen3" +container_name = "vllm_qwen36_27b" max_model_len = 262144 gpu_memory_utilization = 0.92 tensor_parallel_size = 1 @@ -73,13 +73,16 @@ tool_call_parser = "qwen3_coder" # Named model presets for 'ruby hyperstack.rb --vm 1 model switch <name>'. # Each preset overrides the matching [vllm] field; unset fields fall back to [vllm] defaults. -[vllm.presets.qwen3-coder-next] -model = "bullpoint/Qwen3-Coder-Next-AWQ-4bit" -container_name = "vllm_qwen3" +# Qwen3.6-27B FP8 — dense 27B multimodal model with native 262K context. +# Uses qwen3 reasoning parsing plus qwen3_coder tool calling on vLLM >=0.19.0. +[vllm.presets.qwen36-27b] +model = "Qwen/Qwen3.6-27B-FP8" +container_name = "vllm_qwen36_27b" max_model_len = 262144 gpu_memory_utilization = 0.92 tensor_parallel_size = 1 tool_call_parser = "qwen3_coder" +extra_vllm_args = ["--reasoning-parser", "qwen3"] # NVIDIA Nemotron-3-Super-120B-A12B AWQ 4-bit — hybrid Mamba+MoE (12B active / 120B total). # Single-GPU (A100-80GB) config: tensor_parallel_size=1, context capped at 32k to fit in VRAM. diff --git a/hyperstack-vm2.toml b/hyperstack-vm2.toml index c3605ff..faa8054 100644 --- a/hyperstack-vm2.toml +++ b/hyperstack-vm2.toml @@ -55,7 +55,7 @@ listen_host = "0.0.0.0:11434" gpu_overhead_mb = 2000 num_parallel = 1 context_length = 32768 -pull_models = ["qwen3-coder-next"] +pull_models = ["qwen36-27b"] # vLLM serves one model via Docker on the OpenAI-compatible API. # VM2 defaults to Qwen3.6 27B; use 'model switch' to load any other preset. @@ -102,14 +102,6 @@ docker_image = "vllm/vllm-openai:nightly" pre_start_cmd = "pip install -q transformers==5.5.0 2>/dev/null" extra_docker_env = ["CUDA_VISIBLE_DEVICES=0"] -[vllm.presets.qwen3-coder-next] -model = "bullpoint/Qwen3-Coder-Next-AWQ-4bit" -container_name = "vllm_qwen3" -max_model_len = 262144 -gpu_memory_utilization = 0.92 -tensor_parallel_size = 1 -tool_call_parser = "qwen3_coder" - # NVIDIA Nemotron-3-Super-120B-A12B AWQ 4-bit — hybrid Mamba+MoE (12B active / 120B total). # ~60 GB weights on A100 80GB; ~13 GB remaining for KV cache at 0.92 utilisation. # Uses NoPE so any context length is valid; capped at 131072 to keep KV cache within VRAM budget. @@ -1,5 +1,5 @@ # Dual-VM setup (hyperstack-vm1/vm2.toml -> hyperstack1/2.wg1) -abbr pi-hyperstack-coder pi --model hyperstack1/bullpoint/Qwen3-Coder-Next-AWQ-4bit +abbr pi-hyperstack-coder pi --model hyperstack1/Qwen/Qwen3.6-27B-FP8 abbr pi-hyperstack-qwen36 pi --model hyperstack2/Qwen/Qwen3.6-27B-FP8 abbr pi-hyperstack-gemma4 pi --model hyperstack2/cyankiwi/gemma-4-31B-it-AWQ-4bit abbr hyperstack-create ruby ~/git/hyperstack/hyperstack.rb create diff --git a/lib/hyperstack/config.rb b/lib/hyperstack/config.rb index ba143e7..7057b4f 100644 --- a/lib/hyperstack/config.rb +++ b/lib/hyperstack/config.rb @@ -85,9 +85,9 @@ module HyperstackVM }, 'vllm' => { 'install' => true, - 'model' => 'bullpoint/Qwen3-Coder-Next-AWQ-4bit', + 'model' => 'Qwen/Qwen3.6-27B-FP8', 'hug_cache_dir' => '/ephemeral/hug', - 'container_name' => 'vllm_qwen3', + 'container_name' => 'vllm_qwen36_27b', 'max_model_len' => 262_144, 'gpu_memory_utilization' => 0.92, 'tensor_parallel_size' => 1, @@ -378,7 +378,7 @@ <tspan fill="#8b949e"> pi --model hyperstack2/qwen3</tspan> </text> <text x="1112" y="286" fill="#6e7681"> Connecting to hyperstack2.wg1…</text> - <text x="1112" y="302" fill="#58a6ff"> » I am Qwen3-Coder, let's build!</text> + <text x="1112" y="302" fill="#58a6ff"> » I am Qwen3.6, let's build!</text> <!-- Blinking cursor --> <rect x="1112" y="322" width="8" height="14" fill="#58a6ff" opacity="0.8"/> diff --git a/pi/agent/extensions/nemotron-tool-repair/index.ts b/pi/agent/extensions/nemotron-tool-repair/index.ts index 9bb8f94..ae59a66 100644 --- a/pi/agent/extensions/nemotron-tool-repair/index.ts +++ b/pi/agent/extensions/nemotron-tool-repair/index.ts @@ -20,7 +20,7 @@ import type { ExtensionAPI, ExtensionContext } from "@mariozechner/pi-coding-age const CUSTOM_API = "hyperstack-openai-completions-repaired"; const TARGET_PROVIDERS = new Set(["hyperstack1", "hyperstack2"]); const NEMOTRON_MODEL_PATTERN = /NVIDIA-Nemotron-3-Super/i; -// Matches all Qwen Coder variants (Qwen3-Coder-Next, Qwen3-Coder-30B, etc.) +// Matches Qwen3 Coder variants (Qwen3-Coder-30B, etc.) const QWEN_CODER_MODEL_PATTERN = /Qwen.*Coder/i; const MODELS_JSON_PATH = path.resolve( path.dirname(fileURLToPath(import.meta.url)), diff --git a/pi/agent/models.json b/pi/agent/models.json index 48cd0e9..a5e8200 100644 --- a/pi/agent/models.json +++ b/pi/agent/models.json @@ -14,8 +14,15 @@ "id": "openai/gpt-oss-120b", "name": "GPT-OSS 120B [vm]", "reasoning": true, - "input": ["text"], - "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 }, + "input": [ + "text" + ], + "cost": { + "input": 0, + "output": 0, + "cacheRead": 0, + "cacheWrite": 0 + }, "contextWindow": 131072, "maxTokens": 8192 }, @@ -23,17 +30,31 @@ "id": "openai/gpt-oss-20b", "name": "GPT-OSS 20B [vm]", "reasoning": false, - "input": ["text"], - "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 }, + "input": [ + "text" + ], + "cost": { + "input": 0, + "output": 0, + "cacheRead": 0, + "cacheWrite": 0 + }, "contextWindow": 65536, "maxTokens": 8192 }, { - "id": "bullpoint/Qwen3-Coder-Next-AWQ-4bit", - "name": "Qwen3 Coder Next [vm]", + "id": "Qwen/Qwen3.6-27B-FP8", + "name": "Qwen3.6 27B FP8 [vm]", "reasoning": true, - "input": ["text"], - "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 }, + "input": [ + "text" + ], + "cost": { + "input": 0, + "output": 0, + "cacheRead": 0, + "cacheWrite": 0 + }, "contextWindow": 262144, "maxTokens": 8192, "compat": { @@ -47,8 +68,15 @@ "id": "cyankiwi/gemma-4-31B-it-AWQ-4bit", "name": "Gemma 4 31B IT [vm]", "reasoning": false, - "input": ["text"], - "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 }, + "input": [ + "text" + ], + "cost": { + "input": 0, + "output": 0, + "cacheRead": 0, + "cacheWrite": 0 + }, "contextWindow": 131072, "maxTokens": 8192 }, @@ -56,8 +84,15 @@ "id": "cyankiwi/NVIDIA-Nemotron-3-Super-120B-A12B-AWQ-4bit", "name": "Nemotron 3 Super 120B [vm]", "reasoning": false, - "input": ["text"], - "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 }, + "input": [ + "text" + ], + "cost": { + "input": 0, + "output": 0, + "cacheRead": 0, + "cacheWrite": 0 + }, "contextWindow": 262144, "maxTokens": 8192 }, @@ -65,8 +100,15 @@ "id": "Qwen/Qwen2.5-Coder-32B-Instruct-AWQ", "name": "Qwen2.5 Coder 32B [vm]", "reasoning": false, - "input": ["text"], - "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 }, + "input": [ + "text" + ], + "cost": { + "input": 0, + "output": 0, + "cacheRead": 0, + "cacheWrite": 0 + }, "contextWindow": 32768, "maxTokens": 8192 }, @@ -74,8 +116,15 @@ "id": "QuantTrio/Qwen3-Coder-30B-A3B-Instruct-AWQ", "name": "Qwen3 Coder 30B [vm]", "reasoning": true, - "input": ["text"], - "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 }, + "input": [ + "text" + ], + "cost": { + "input": 0, + "output": 0, + "cacheRead": 0, + "cacheWrite": 0 + }, "contextWindow": 65536, "maxTokens": 8192, "compat": { @@ -89,8 +138,15 @@ "id": "casperhansen/deepseek-r1-distill-qwen-32b-awq", "name": "DeepSeek-R1-Distill 32B [vm]", "reasoning": true, - "input": ["text"], - "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 }, + "input": [ + "text" + ], + "cost": { + "input": 0, + "output": 0, + "cacheRead": 0, + "cacheWrite": 0 + }, "contextWindow": 32768, "maxTokens": 8192 }, @@ -98,8 +154,15 @@ "id": "Qwen/Qwen3-32B-AWQ", "name": "Qwen3 32B [vm]", "reasoning": true, - "input": ["text"], - "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 }, + "input": [ + "text" + ], + "cost": { + "input": 0, + "output": 0, + "cacheRead": 0, + "cacheWrite": 0 + }, "contextWindow": 32768, "maxTokens": 8192, "compat": { @@ -113,8 +176,15 @@ "id": "cyankiwi/Devstral-Small-2507-AWQ-4bit", "name": "Devstral Small 2507 [vm]", "reasoning": false, - "input": ["text"], - "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 }, + "input": [ + "text" + ], + "cost": { + "input": 0, + "output": 0, + "cacheRead": 0, + "cacheWrite": 0 + }, "contextWindow": 32768, "maxTokens": 8192 } @@ -134,8 +204,15 @@ "id": "cyankiwi/gemma-4-31B-it-AWQ-4bit", "name": "Gemma 4 31B IT [vm1]", "reasoning": false, - "input": ["text"], - "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 }, + "input": [ + "text" + ], + "cost": { + "input": 0, + "output": 0, + "cacheRead": 0, + "cacheWrite": 0 + }, "contextWindow": 131072, "maxTokens": 8192 }, @@ -143,17 +220,31 @@ "id": "cyankiwi/NVIDIA-Nemotron-3-Super-120B-A12B-AWQ-4bit", "name": "Nemotron 3 Super 120B 1M [vm1]", "reasoning": false, - "input": ["text"], - "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 }, + "input": [ + "text" + ], + "cost": { + "input": 0, + "output": 0, + "cacheRead": 0, + "cacheWrite": 0 + }, "contextWindow": 1048576, "maxTokens": 8192 }, { - "id": "bullpoint/Qwen3-Coder-Next-AWQ-4bit", - "name": "Qwen3 Coder Next [vm1]", + "id": "Qwen/Qwen3.6-27B-FP8", + "name": "Qwen3.6 27B FP8 [vm1]", "reasoning": true, - "input": ["text"], - "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 }, + "input": [ + "text" + ], + "cost": { + "input": 0, + "output": 0, + "cacheRead": 0, + "cacheWrite": 0 + }, "contextWindow": 262144, "maxTokens": 8192, "compat": { @@ -167,8 +258,15 @@ "id": "openai/gpt-oss-20b", "name": "GPT-OSS 20B [vm1]", "reasoning": false, - "input": ["text"], - "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 }, + "input": [ + "text" + ], + "cost": { + "input": 0, + "output": 0, + "cacheRead": 0, + "cacheWrite": 0 + }, "contextWindow": 65536, "maxTokens": 8192 }, @@ -176,8 +274,15 @@ "id": "openai/gpt-oss-120b", "name": "GPT-OSS 120B [vm1]", "reasoning": true, - "input": ["text"], - "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 }, + "input": [ + "text" + ], + "cost": { + "input": 0, + "output": 0, + "cacheRead": 0, + "cacheWrite": 0 + }, "contextWindow": 131072, "maxTokens": 8192 }, @@ -185,8 +290,15 @@ "id": "Qwen/Qwen2.5-Coder-32B-Instruct-AWQ", "name": "Qwen2.5 Coder 32B [vm1]", "reasoning": false, - "input": ["text"], - "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 }, + "input": [ + "text" + ], + "cost": { + "input": 0, + "output": 0, + "cacheRead": 0, + "cacheWrite": 0 + }, "contextWindow": 32768, "maxTokens": 8192 }, @@ -194,8 +306,15 @@ "id": "QuantTrio/Qwen3-Coder-30B-A3B-Instruct-AWQ", "name": "Qwen3 Coder 30B [vm1]", "reasoning": true, - "input": ["text"], - "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 }, + "input": [ + "text" + ], + "cost": { + "input": 0, + "output": 0, + "cacheRead": 0, + "cacheWrite": 0 + }, "contextWindow": 65536, "maxTokens": 8192, "compat": { @@ -209,8 +328,15 @@ "id": "casperhansen/deepseek-r1-distill-qwen-32b-awq", "name": "DeepSeek-R1-Distill 32B [vm1]", "reasoning": true, - "input": ["text"], - "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 }, + "input": [ + "text" + ], + "cost": { + "input": 0, + "output": 0, + "cacheRead": 0, + "cacheWrite": 0 + }, "contextWindow": 32768, "maxTokens": 8192 }, @@ -218,8 +344,15 @@ "id": "Qwen/Qwen3-32B-AWQ", "name": "Qwen3 32B [vm1]", "reasoning": true, - "input": ["text"], - "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 }, + "input": [ + "text" + ], + "cost": { + "input": 0, + "output": 0, + "cacheRead": 0, + "cacheWrite": 0 + }, "contextWindow": 32768, "maxTokens": 8192, "compat": { @@ -233,8 +366,15 @@ "id": "cyankiwi/Devstral-Small-2507-AWQ-4bit", "name": "Devstral Small 2507 [vm1]", "reasoning": false, - "input": ["text"], - "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 }, + "input": [ + "text" + ], + "cost": { + "input": 0, + "output": 0, + "cacheRead": 0, + "cacheWrite": 0 + }, "contextWindow": 32768, "maxTokens": 8192 } @@ -254,8 +394,15 @@ "id": "Qwen/Qwen3.6-27B-FP8", "name": "Qwen3.6 27B FP8 [vm2]", "reasoning": true, - "input": ["text"], - "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 }, + "input": [ + "text" + ], + "cost": { + "input": 0, + "output": 0, + "cacheRead": 0, + "cacheWrite": 0 + }, "contextWindow": 262144, "maxTokens": 8192, "compat": { @@ -269,17 +416,31 @@ "id": "cyankiwi/gemma-4-31B-it-AWQ-4bit", "name": "Gemma 4 31B IT [vm2]", "reasoning": false, - "input": ["text"], - "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 }, + "input": [ + "text" + ], + "cost": { + "input": 0, + "output": 0, + "cacheRead": 0, + "cacheWrite": 0 + }, "contextWindow": 131072, "maxTokens": 8192 }, { - "id": "bullpoint/Qwen3-Coder-Next-AWQ-4bit", - "name": "Qwen3 Coder Next [vm2]", + "id": "Qwen/Qwen3.6-27B-FP8", + "name": "Qwen3.6 27B FP8 [vm2]", "reasoning": true, - "input": ["text"], - "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 }, + "input": [ + "text" + ], + "cost": { + "input": 0, + "output": 0, + "cacheRead": 0, + "cacheWrite": 0 + }, "contextWindow": 262144, "maxTokens": 8192, "compat": { @@ -293,8 +454,15 @@ "id": "cyankiwi/NVIDIA-Nemotron-3-Super-120B-A12B-AWQ-4bit", "name": "Nemotron 3 Super 120B [vm2]", "reasoning": false, - "input": ["text"], - "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 }, + "input": [ + "text" + ], + "cost": { + "input": 0, + "output": 0, + "cacheRead": 0, + "cacheWrite": 0 + }, "contextWindow": 262144, "maxTokens": 8192 }, @@ -302,8 +470,15 @@ "id": "openai/gpt-oss-20b", "name": "GPT-OSS 20B [vm2]", "reasoning": false, - "input": ["text"], - "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 }, + "input": [ + "text" + ], + "cost": { + "input": 0, + "output": 0, + "cacheRead": 0, + "cacheWrite": 0 + }, "contextWindow": 65536, "maxTokens": 8192 }, @@ -311,8 +486,15 @@ "id": "openai/gpt-oss-120b", "name": "GPT-OSS 120B [vm2]", "reasoning": true, - "input": ["text"], - "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 }, + "input": [ + "text" + ], + "cost": { + "input": 0, + "output": 0, + "cacheRead": 0, + "cacheWrite": 0 + }, "contextWindow": 131072, "maxTokens": 8192 }, @@ -320,8 +502,15 @@ "id": "Qwen/Qwen2.5-Coder-32B-Instruct-AWQ", "name": "Qwen2.5 Coder 32B [vm2]", "reasoning": false, - "input": ["text"], - "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 }, + "input": [ + "text" + ], + "cost": { + "input": 0, + "output": 0, + "cacheRead": 0, + "cacheWrite": 0 + }, "contextWindow": 32768, "maxTokens": 8192 }, @@ -329,8 +518,15 @@ "id": "QuantTrio/Qwen3-Coder-30B-A3B-Instruct-AWQ", "name": "Qwen3 Coder 30B [vm2]", "reasoning": true, - "input": ["text"], - "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 }, + "input": [ + "text" + ], + "cost": { + "input": 0, + "output": 0, + "cacheRead": 0, + "cacheWrite": 0 + }, "contextWindow": 65536, "maxTokens": 8192, "compat": { @@ -344,8 +540,15 @@ "id": "casperhansen/deepseek-r1-distill-qwen-32b-awq", "name": "DeepSeek-R1-Distill 32B [vm2]", "reasoning": true, - "input": ["text"], - "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 }, + "input": [ + "text" + ], + "cost": { + "input": 0, + "output": 0, + "cacheRead": 0, + "cacheWrite": 0 + }, "contextWindow": 32768, "maxTokens": 8192 }, @@ -353,8 +556,15 @@ "id": "Qwen/Qwen3-32B-AWQ", "name": "Qwen3 32B [vm2]", "reasoning": true, - "input": ["text"], - "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 }, + "input": [ + "text" + ], + "cost": { + "input": 0, + "output": 0, + "cacheRead": 0, + "cacheWrite": 0 + }, "contextWindow": 32768, "maxTokens": 8192, "compat": { @@ -368,8 +578,15 @@ "id": "cyankiwi/Devstral-Small-2507-AWQ-4bit", "name": "Devstral Small 2507 [vm2]", "reasoning": false, - "input": ["text"], - "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 }, + "input": [ + "text" + ], + "cost": { + "input": 0, + "output": 0, + "cacheRead": 0, + "cacheWrite": 0 + }, "contextWindow": 32768, "maxTokens": 8192 } |
