update prompts

author: Paul Buetow <paul@buetow.org> 2026-06-03 23:26:23 +0300
committer: Paul Buetow <paul@buetow.org> 2026-06-03 23:26:23 +0300
commit: afd73bb974b321fea036d0b785a46168f79dc658 (patch)
tree: a98729f87a7cca85429ec7de065d43d9f63c595b
parent: 7f3b3d58527db7a5e6a167555e1526d6836d7fe4 (diff)
1 files changed, 225 insertions, 0 deletions
diff --git a/prompts/skills/llm-benchmark-comparison/SKILL.md b/prompts/skills/llm-benchmark-comparison/SKILL.md
new file mode 100644
index 0000000..7b55f6c
--- /dev/null
+++ b/prompts/skills/llm-benchmark-comparison/SKILL.md
@@ -0,0 +1,225 @@
+---
+name: llm-benchmark-comparison
+description: "Research and compare LLM model benchmarks (coding, agentic, reasoning, multimodal) across frontier and open-weight models. Produces a side-by-side comparison table with cost, context, modality, and per-benchmark scores; calls out ties, caveats, and 'what the numbers hide'. Use when asked to compare LLMs, benchmark models, rank models, or build a model-selection report. Triggers on: compare LLMs, LLM benchmark, model comparison, GPT vs Claude vs DeepSeek, which model is best, model leaderboard, SWE-Bench comparison, benchmark scores."
+---
+
+# LLM Benchmark Comparison
+
+A repeatable workflow for turning "compare these LLMs" into a sourced,
+side-by-side benchmark report. Output is a markdown table the user can paste
+into a blog post, RFC, or procurement doc.
+
+## When to Use
+
+- The user names **two or more models** and wants them compared on
+  benchmarks (e.g. "MiniMax M3 vs Kimi K2.6 vs GLM 5.1", "GPT-5.5 vs Claude
+  Opus 4.7", "DeepSeek V4 vs Qwen3.7-Max").
+- The user asks "which model is best at X" and you need to weigh multiple
+  benchmarks.
+- The user wants a model-selection report: benchmarks, pricing, context,
+  license, modalities — in one table.
+
+## Inputs
+
+Resolve these before searching; if the user gave you a list of models, use
+those. Otherwise ask which models to compare. Typical candidates from 2026:
+
+- **Closed frontier:** Claude Opus 4.7, GPT-5.5, Gemini 3.1 Pro
+- **Open-weight flagships (Chinese):** MiniMax M3, Kimi K2.6, GLM 5.1,
+  DeepSeek V4, Qwen3.7-Max
+- **Open-weight flagships (US):** Llama 4 Behemoth, Nemotron 3 Ultra
+
+## Workflow
+
+### 1. Identify the comparison set
+
+Confirm the model names, including version suffixes. Many labs ship a new
+point release every 1–3 months; `Kimi K2.6` is different from `Kimi K2.5` or
+`Kimi K2 Thinking`. Get the exact release the user means.
+
+### 2. Search the web for each model
+
+Prefer the **HTML DuckDuckGo endpoint** (`https://html.duckduckgo.com/html/?q=…`)
+because the JSON API and the JS site return empty in this environment. Brave
+Search (`https://search.brave.com/search?q=…`) is a useful fallback. Bing
+and Google both gate on CAPTCHA.
+
+Useful per-model queries:
+
+- `"<model>" benchmark scores`
+- `"<model>" SWE-Bench Pro` (the de-facto coding test in 2026)
+- `"<model>" pricing context window`
+- `"<model>" site:huggingface.co` (model card)
+- `"<model>" site:artificialanalysis.ai` (independent evals)
+- `"<model>" review vs` (head-to-head pieces)
+
+### 3. Pull authoritative sources
+
+In this order of trust:
+
+1. **Vendor blog / launch post** — gives headline numbers, but watch for
+   cherry-picking and self-reported harnesses.
+2. **Hugging Face model card** — usually has the most complete benchmark
+   table; check the license, params, context.
+3. **Artificial Analysis article** — independent evals (Intelligence
+   Index, GDPval-AA Elo, AA-Omniscience hallucination rate).
+4. **OpenRouter / LLM-Stats pricing pages** — current $/M token rates.
+5. **Independent reviews** — Lushbinary, OfficeChai, Analytics India,
+   Geeky Gadgets; useful for "what the numbers hide" and real-world
+   anecdote.
+
+For each model, capture:
+
+| Field | Where to find it |
+|---|---|
+| Lab + release date | Vendor blog or HF model card |
+| Architecture (dense/MoE, total/active params, attention type) | HF model card or technical report |
+| Context window | Vendor page, often in the model card |
+| Modalities | HF model card, README |
+| License | HF model card (top of page) |
+| **SWE-Bench Pro** | Almost always in launch blog |
+| **SWE-Bench Verified** | Vendor blog, leaderboards |
+| **Terminal-Bench 2.0 / 2.1** | Vendor blog, "agentic" tables |
+| **BrowseComp** | Agentic browsing tests; check vendor and AA |
+| **HLE (Humanity's Last Exam)** | With and without tools, often reported both |
+| **GPQA-Diamond** | Reasoning benchmark; check leaderboards |
+| **AIME 2026** | Math reasoning; check leaderboards |
+| **τ²-Bench / τ²-Bench Telecom** | Tool-use / agentic |
+| **MCP-Atlas** | Tool-use over MCP |
+| **AA Intelligence Index** | artificialanalysis.ai |
+| **AA-Omniscience hallucination** | artificialanalysis.ai |
+| **Input $/M, Output $/M** | OpenRouter, vendor pricing page, LLM-Stats |
+
+If a number is missing after two passes, mark it `n/a` — do not invent.
+
+### 4. Build the comparison table
+
+Use this structure (markdown, copy-paste-ready):
+
+```markdown
+| | **Model A** | **Model B** | **Model C** |
+|---|---|---|---|
+| Lab | ... | ... | ... |
+| Released | YYYY-MM-DD | ... | ... |
+| Architecture | ... | ... | ... |
+| Context | ... | ... | ... |
+| Modalities | ... | ... | ... |
+| License | ... | ... | ... |
+| Input $/M (promo/std) | ... | ... | ... |
+| Output $/M | ... | ... | ... |
+
+| Benchmark | Model A | Model B | Model C | Notes |
+|---|---:|---:|---:|---|
+| SWE-Bench Pro | ... | ... | ... | ... |
+| SWE-Bench Verified | ... | ... | ... | ... |
+| Terminal-Bench 2.x | ... | ... | ... | ... |
+| BrowseComp | ... | ... | ... | ... |
+| HLE (w/ tools) | ... | ... | ... | ... |
+| GPQA-Diamond | ... | ... | ... | ... |
+| AIME 2026 | ... | ... | ... | ... |
+| τ²-Bench Telecom | ... | ... | ... | ... |
+| MCP-Atlas | ... | ... | ... | ... |
+| AA Intelligence Index | ... | ... | ... | ... |
+```
+
+Right-align numeric columns (`---:`) so the digits line up. Round benchmark
+percentages to one decimal. Keep the Notes column short — one phrase, not
+a sentence.
+
+### 5. Write the analysis
+
+After the table, include these sections in order:
+
+1. **Quick reference** — a one-row-per-model summary table covering the
+   "what is it" fields (lab, release, arch, context, modalities, license,
+   price). Optional but useful when the user has 3+ models.
+2. **Headline benchmark table** — the long table above.
+3. **How to read this** — 4–8 bullets, each one a claim grounded in the
+   numbers:
+   - "Coding is effectively a tie at 58–59% on SWE-Bench Pro"
+   - "K2.6 is the most battle-tested for long-horizon work (13-hour
+     exchange-core rewrite, 5-day autonomous ops agent)"
+   - "M3's pitch is cheap 1M context, GLM 5.1's is breadth, K2.6's is
+     tool-use + reasoning"
+4. **Cost reality check** — a per-task worked example. Standard
+   workload: 500K input tokens + 100K output tokens. Compute
+   `(0.5 × input_$/M) + (0.1 × output_$/M)` per model. Reference Claude
+   Opus-class price as a baseline.
+5. **Caveats** — required. Cover at least:
+   - Vendor-published numbers (most are)
+   - Harness / scaffold differences (Claude Code, Terminus, OpenHands,
+     Mini-SWE-Agent) — same model scores differently across harnesses
+   - Token usage differences (K2.6 uses ~2× K2.5 tokens; M3's 1M context
+     costs more tokens per turn)
+   - Open-weights claim status (M3 weights + technical report promised
+     ~10 days post-launch, not day-one in June 2026)
+   - Independent verification status
+6. **Sources** — list the URLs you actually pulled from, not the
+   search-results page.
+
+### 6. Anti-patterns to avoid
+
+- **Don't average across benchmarks** ("average score 73%") — hides the
+  shape. SWE-Bench Pro 58% and AIME 95% are not commensurable.
+- **Don't quote vendor numbers without a harness caveat.** A 59% on
+  SWE-Bench Pro run with Claude Code scaffolding is not the same as 59%
+  with OpenHands.
+- **Don't pick a "winner" unless the user asks.** The whole point of the
+  comparison is that the right model depends on the workload: long
+  context vs. low cost vs. best tool use vs. strongest reasoning.
+- **Don't fabricate a score** to fill a cell. `n/a` is honest; "60%"
+  invented is not.
+- **Don't bury the cost.** Pricing is often the deciding factor for
+  agentic workloads and usually changes the ranking entirely.
+- **Don't ignore the "what the numbers hide" angle.** A Medium-style
+  "I tested both on 15 real tasks" piece usually finds that the
+  benchmark gap is much smaller than the practical gap (or vice versa).
+  Worth citing at least one such piece per comparison.
+
+### 7. Output format
+
+- Plain markdown, ready to paste.
+- Tables first, prose after.
+- One section per analysis point, no walls of text.
+- Sources at the bottom as a bulleted URL list, not inline links (easier
+  to copy and verify).
+
+## Quick benchmark glossary (2026)
+
+- **SWE-Bench Pro** — Real GitHub issues, harder subset of SWE-Bench.
+  Industry-standard coding test in 2026. Currently ~58% is the open-weight
+  SOTA bar.
+- **SWE-Bench Verified** — Human-verified easier subset. Top closed
+  models hit 80%+.
+- **Terminal-Bench 2.0 / 2.1** — Real command-line agent tasks.
+- **BrowseComp** — Autonomous web browsing + information retrieval.
+- **HLE (Humanity's Last Exam)** — Hardest reasoning benchmark; usually
+  reported with and without tools.
+- **τ²-Bench / τ²-Bench Telecom** — Tool-use in agentic loop, telecom
+  domain.
+- **MCP-Atlas** — Tool use over Model Context Protocol.
+- **GPQA-Diamond** — Graduate-level science Q&A.
+- **AIME** — Math competition problems.
+- **AA Intelligence Index** — Composite index from Artificial Analysis
+  combining many of the above; 54–57 is the current frontier band.
+- **AA-Omniscience** — Hallucination + abstention metric; lower
+  hallucination rate is better.
+- **GDPval-AA Elo** — General agentic performance on knowledge-work
+  tasks.
+
+## Example output shape
+
+A minimal good response is two tables + four short sections. A maximal
+good response is two tables + five sections + sources. Anything longer
+than that is padding.
+
+## References
+
+- `https://html.duckduckgo.com/html/?q=...` — primary search endpoint
+- `https://search.brave.com/search?q=...` — fallback search
+- `https://artificialanalysis.ai/articles/...` — independent evals
+- `https://huggingface.co/<org>/<model>` — model cards
+- `https://openrouter.ai/<provider>/<model>/benchmarks` — pricing +
+  benchmarks in one place
+- `https://llm-stats.com/home/models/<model>` — pricing + benchmark
+  snapshot
author	Paul Buetow <paul@buetow.org>	2026-06-03 23:26:23 +0300
committer	Paul Buetow <paul@buetow.org>	2026-06-03 23:26:23 +0300
commit	afd73bb974b321fea036d0b785a46168f79dc658 (patch)
tree	a98729f87a7cca85429ec7de065d43d9f63c595b
parent	7f3b3d58527db7a5e6a167555e1526d6836d7fe4 (diff)