Going in, the expected ranking was Opus > Sonnet > Haiku. The reasoning: more capable models with deeper reasoning, richer code output, and access to agent teams should outperform the smallest, cheapest model on a complex, multi-file engineering task.
Three specific predictions were under test:
All three were wrong.
Haiku won decisively (260 vs 211 vs 203). Agent teams caused Opus to lose 43 points on C3. And rubric awareness had no consistent positive effect — Sonnet's Decap config score actually dropped from 10/10 to 3/10 when given the rubric. The results suggest that task-fit matters more than raw capability: for well-specified, compliance-heavy work, disciplined execution beats deeper reasoning.
BATTLEARENA pits Claude Code CLI models against each other on a complex, real-world web engineering challenge. Each model gets the same prompt, the same Docker container, the same tools — and has to build a working project from scratch.
Build a working proof-of-concept repository with:
The agent must produce a buildable project that passes 14 automated checks across correctness, spec adherence, code quality, and failure recovery.
The agent receives only the task description — no rubric, no scoring criteria, no hints about what will be checked. This measures raw engineering instinct: does the model naturally produce well-structured, complete code when given a straightforward spec?
What it tests: Can the model ship a working project without hand-holding? Does it add READMEs, comments, and proper structure on its own?
Same task, but the agent receives the full scoring rubric upfront — every check, every point value, every criterion. This measures instruction-following precision: when told exactly what will be graded, does the model optimize for it?
What it tests: Can the model translate explicit requirements into code? Does knowing the rubric improve output quality vs. C1?
Same task with rubric, but the agent is explicitly told to use Claude's agent teams feature — spawning sub-agents to parallelize work. This measures coordination under delegation: can the model break down work, delegate to sub-agents, and reassemble a coherent result?
What it tests: Does parallelization help or hurt? Can the model maintain quality while coordinating multiple agents? This is where Opus catastrophically failed (44/100) — delegation caused lost files.
Mid-run, a chaos monkey silently corrupts the agent's .eleventy.js config — changing the output directory from _site to _output. The agent isn't told this happened.
This tests whether the model verifies its own work before declaring completion. Does it notice the build output changed? Does it re-run the build and catch the discrepancy? Or does it blindly declare success?
Each run executes in an isolated Docker container (node:20-slim) with Claude Code CLI v2.1.71. Containers run as the host user (no root), with workspace bind-mounted at /workspace. All 9 runs (3 challenges × 3 models) execute concurrently on the same host.
A live dashboard with AI-generated sports commentary (Jack Michaels play-by-play, Louie DeBrusk colour commentary via GPT-4o-mini) provides real-time monitoring. Scoring is fully automated — 14 checks run against the agent's workspace after completion.
Each run is scored out of 100 points across 5 dimensions. 90 points are fully automated; 10 points (speed) are computed from wall-clock time with a correctness gate.
Human Override: Check 3.1 (Comments) has a single human override option for manual review of OAuth function and config comments. All other scoring is fully deterministic.
| Check | Pts | Dimension | What It Tests |
|---|---|---|---|
| 1.1 Build | 10 | Correctness | Runs npm run build in Docker. Verifies _site/index.html exists and contains valid HTML structure (<html> and <body> tags). |
| 1.2 Structure | 10 | Correctness | Checks 5 required files (2 pts each): functions/api/, src/admin/index.html, src/admin/config.yml, src/admin/custom.css, src/_includes/base.njk, src/content/index.md. Penalty: −4 if functions/ found inside src/ or _site/. |
| 1.3 Decap | 10 | Correctness | Validates config.yml: valid YAML, is a mapping, contains required keys (backend, publish_mode, collections, media_folder). Full marks require publish_mode: editorial_workflow. |
| 1.4 OAuth | 10 | Correctness | Validates auth.js (syntax, GitHub redirect URL, GITHUB_CLIENT_ID env var) and callback.js (syntax, access_token exchange, postMessage, GITHUB_CLIENT_SECRET env var). 5 pts each. |
| 2.1 Placeholders | 5 | Spec Adherence | Detects if agent invented repo values vs. using # TODO: markers. Full marks only if placeholders are present AND marked with # TODO: comments. |
| 2.2 No Forbidden | 5 | Spec Adherence | Strict allow-list: only @11ty/eleventy, decap-cms-app, netlify-cms-app, and @types/* permitted. −1 per violation. Also checks for wrangler and uncommented local_backend: true. |
| 2.3 Editor | 5 | Spec Adherence | Verifies src/content/ exists with .md files, index.md has YAML frontmatter with expected fields (title, description, hero_heading, hero_subtext). |
| 2.4 README | 5 | Spec Adherence | 1 pt per topic covered (keyword matching): local dev setup, Cloudflare deployment, OAuth registration, adding a field, skinning limitations. |
| 3.1 Comments | 5 | Code Quality | Analyzes 4 files for comment ratio and "substantive" comments (containing "because", "why", "required", "note:", "important" with >5 words). 5/5 requires >10% ratio + 3 substantive. |
| 3.2 CSS | 5 | Code Quality | Checks custom.css for CSS custom properties (2 pts), skinning limitation docs in comments (2 pts), and public/style.css existence (1 pt). |
| 3.3 Eleventy | 5 | Code Quality | Executes .eleventy.js via mocked Node runner. Checks: dir.input = "src" (1 pt), dir.output = "_site" (2 pts), addPassthroughCopy() called (1 pt), addCollection() called (1 pt). |
| 3.4 Separation | 5 | Code Quality | Template separation: base.njk exists (1), has {{ content | safe }} (1), index.njk exists (1), references base layout (1), content/index.md exists (1). |
| 4 Recovery | 10 | Recovery | Checks if agent detected and corrected the sabotaged _output → _site corruption. 10/10 for fix, 2/10 if corruption still present, 0 if file missing. 7/10 when no sabotage was injected. |
| 5 Speed | 10 | Speed | Wall-clock time tiers: <120s = 10 pts, <300s = 7 pts, <600s = 3 pts, ≥600s = 0 pts. Gate: requires Correctness ≥ 30/40 to qualify. |
A background watcher monitors each agent's workspace for the creation of .eleventy.js. Once detected, it waits a short delay, then silently corrupts the output directory from _site to _output. The agent is not notified.
This tests whether the model verifies its own work before declaring completion. Of 9 runs, 8 received the injection. C3 Sonnet never created the file, so no injection occurred.
Injection timing varies by run (39–136 seconds into execution), creating different recovery windows. Agents with more time remaining after injection had more opportunity to catch it — yet Sonnet failed even with 163–381 seconds remaining.
| Run | Model ID | Start (UTC−7) | End | Wall (s) | Injected | Inject Offset | Recovery Window | Container |
|---|---|---|---|---|---|---|---|---|
| C1 Haiku | claude-haiku-4-5-20251001 | 12:36:09 | 12:39:18 | 189.1 | Yes | ~62s | ~127s | decap-bench-c1-haiku |
| C1 Sonnet | claude-sonnet-4-6 | 12:36:11 | 12:41:10 | 298.8 | Yes | ~136s | ~163s | decap-bench-c1-sonnet |
| C1 Opus | claude-opus-4-6 | 12:36:13 | 12:41:15 | 301.5 | Yes | ~48s | ~254s | decap-bench-c1-opus |
| C2 Haiku | claude-haiku-4-5-20251001 | 12:36:15 | 12:41:45 | 329.6 | Yes | ~39s | ~291s | decap-bench-c2-haiku |
| C2 Sonnet | claude-sonnet-4-6 | 12:36:18 | 12:43:26 | 428.2 | Yes | ~47s | ~381s | decap-bench-c2-sonnet |
| C2 Opus | claude-opus-4-6 | 12:36:20 | 12:44:01 | 461.0 | Yes | ~53s | ~408s | decap-bench-c2-opus |
| C3 Haiku | claude-haiku-4-5-20251001 | 12:36:22 | 12:41:55 | 332.8 | Yes | ~62s | ~271s | decap-bench-c3-haiku |
| C3 Sonnet | claude-sonnet-4-6 | 12:36:25 | 12:42:51 | 386.0 | No* | — | — | decap-bench-c3-sonnet |
| C3 Opus | claude-opus-4-6 | 12:36:27 | 12:44:21 | 474.1 | Yes | ~122s | ~352s | decap-bench-c3-opus |
* C3 Sonnet never created .eleventy.js, so no injection was possible. All runs executed concurrently on the same host.
Challenge 3 instructed all models to use Claude's agent teams feature (sub-agent spawning). Each model received a different delegation strategy:
Pattern: Prompt specificity inversely correlated with delegation failures. The most constrained prompt (Haiku) produced the best result. The most autonomous prompt (Opus) produced the worst. Sub-agents appear to require explicit task boundaries to avoid coordination loss.
The benchmark framework supports comparing archived runs via CLI:
python3 score.py --compare <run-tag-A> <run-tag-B>
This computes per-check deltas between two runs, showing exactly which checks improved or regressed. However, this page shows a single run snapshot. Future benchmark iterations would enable tracking model performance trends over time (e.g., does a new model version improve recovery rates?).
No archived comparison runs exist yet — this is the baseline run.
The benchmark is orchestrated by a single Python script that spawns 9 concurrent Docker containers, monitors them for completion, and injects the deliberate .eleventy.js corruption for failure-recovery testing. Below is the full source, anonymized and parameterized.
#!/usr/bin/env python3
"""
spawn.py — Agent benchmark orchestrator
Spawns N concurrent agent runs (challenges × models) as isolated Docker
containers, injects deliberate failures for recovery testing, monitors
completion, and writes per-run JSON manifests for the scoring pipeline.
Usage:
python spawn.py [--challenge 1|2|3|all] [--models haiku sonnet opus]
python spawn.py --dry-run
python spawn.py --kill
"""
import argparse, json, logging, os, shutil, subprocess, sys
import threading, time
from datetime import datetime
from pathlib import Path
# ── Configuration (parameterized via env vars) ──────────────────────
BASE_DIR = Path(os.environ.get("BENCH_BASE_DIR", "./experiments")).resolve()
RESULTS_DIR = Path(os.environ.get("BENCH_RESULTS_DIR", "./results")).resolve()
DOCKER_IMAGE = os.environ.get("BENCH_DOCKER_IMAGE", "bench-agent:latest")
HOST_UID = os.getuid()
HOST_GID = os.getgid()
MODELS = {
"haiku": "claude-haiku-4-5-20251001",
"sonnet": "claude-sonnet-4-6",
"opus": "claude-opus-4-6",
}
CONTAINER_PREFIX = "bench"
def work_dir(challenge, model):
return BASE_DIR / f"challenge-{challenge}" / model
def container_name(challenge, model):
return f"{CONTAINER_PREFIX}-c{challenge}-{model}"
# ── Failure Injection ────────────────────────────────────────────────
def watch_and_inject(challenge, model):
"""
Background thread: watch for .eleventy.js to appear in the agent's
workspace. Once stable (unchanged for 2s), inject _output corruption
exactly once. This simulates a config breakage mid-run that the
agent must detect and recover from.
"""
wdir = work_dir(challenge, model)
target = wdir / ".eleventy.js"
log_path = wdir / "run.log"
deadline = time.time() + 1800 # 30-minute timeout
while time.time() < deadline:
# Bail if the run already finished
if log_path.exists() and "__RUN_COMPLETE__" in log_path.read_text():
return
if target.exists():
try:
size_before = target.stat().st_size
time.sleep(2) # wait for write to settle
size_after = target.stat().st_size
except FileNotFoundError:
continue
if size_before == size_after and size_after > 0:
content = target.read_text()
if "_site" in content and "_output" not in content:
# Inject: replace correct output dir with wrong one
corrupted = content.replace("_site", "_output", 1)
target.write_text(corrupted)
# Record injection timestamp in the run manifest
merge_manifest(challenge, model, {
"failure_injected": True,
"failure_inject_ts": time.time(),
})
return
elif "_output" in content:
return # already corrupted, skip
time.sleep(1)
# ── Container Spawning ───────────────────────────────────────────────
def spawn_run(challenge, model, no_inject=False):
"""
Start one agent run as a detached Docker container.
Container setup:
- node:20-slim base image with Claude Code CLI pinned
- Runs as host UID:GID (no root)
- Workspace bind-mounted at /workspace
- Isolated $HOME at /tmp/agent_home
- Prompt piped via stdin from /workspace/PROMPT.md
- Output captured to /workspace/run.log
- Sentinel '__RUN_COMPLETE__' appended on exit
"""
wdir = work_dir(challenge, model)
model_id = MODELS[model]
cname = container_name(challenge, model)
start_ts = time.time()
# Write initial manifest
write_manifest(f"c{challenge}-{model}", {
"challenge": challenge,
"model": model,
"model_id": model_id,
"work_dir": str(wdir),
"container_name": cname,
"start_ts": start_ts,
"start_iso": datetime.fromtimestamp(start_ts).isoformat(),
"status": "running",
"failure_injected": False,
})
docker_cmd = [
"docker", "run", "-d", "--rm",
"--name", cname,
"--user", f"{HOST_UID}:{HOST_GID}",
"-v", f"{wdir}:/workspace",
"-v", f"{wdir}/.home:/tmp/agent_home",
"-w", "/workspace",
"-e", "HOME=/tmp/agent_home",
DOCKER_IMAGE,
"sh", "-c",
f"claude -p --model {model_id} --dangerously-skip-permissions"
f" --verbose --output-format stream-json"
f" < /workspace/PROMPT.md"
f" > /workspace/run.log 2>&1 ;"
f" echo '__RUN_COMPLETE__' >> /workspace/run.log",
]
subprocess.run(docker_cmd, check=True)
# Start failure injection watcher in background
if not no_inject:
threading.Thread(
target=watch_and_inject,
args=(challenge, model),
daemon=True,
).start()
# ── Completion Monitor ───────────────────────────────────────────────
def monitor_completion(challenges, models):
"""
Poll run.log files and container state until all runs finish.
A run is "complete" when its log contains __RUN_COMPLETE__.
A run is "crashed" if its container exits without the sentinel.
"""
pending = {f"c{c}-{m}" for c in challenges for m in models}
while pending:
for run_key in list(pending):
c, m = int(run_key[1]), run_key.split("-")[1]
wdir = work_dir(c, m)
log_path = wdir / "run.log"
cname = container_name(c, m)
is_complete = (
log_path.exists()
and "__RUN_COMPLETE__" in log_path.read_text()
)
container_dead = False
if not is_complete:
res = subprocess.run(
["docker", "inspect", "--format",
"{{.State.Running}}", cname],
capture_output=True, text=True,
)
if res.returncode != 0 or res.stdout.strip() == "false":
container_dead = True
if is_complete or container_dead:
end_ts = time.time()
manifest = read_manifest(c, m) or {}
wall = round(end_ts - manifest.get("start_ts", end_ts), 1)
manifest.update({
"end_ts": end_ts,
"end_iso": datetime.fromtimestamp(end_ts).isoformat(),
"wall_seconds": wall,
"status": "complete" if is_complete else "crashed",
})
write_manifest(run_key, manifest)
pending.discard(run_key)
if pending:
time.sleep(5)
# ── Main ─────────────────────────────────────────────────────────────
def main():
parser = argparse.ArgumentParser(
description="Spawn agent benchmark runs"
)
parser.add_argument(
"--challenge", choices=["1","2","3","all"], default="all",
)
parser.add_argument(
"--models", nargs="+",
choices=list(MODELS.keys()),
default=list(MODELS.keys()),
)
parser.add_argument("--dry-run", action="store_true")
parser.add_argument("--no-inject", action="store_true")
parser.add_argument("--kill", action="store_true")
args = parser.parse_args()
challenges = (
[1, 2, 3] if args.challenge == "all"
else [int(args.challenge)]
)
# Setup workspace directories, write prompts, build image
setup_directories(challenges, args.models)
# Spawn all runs with 2s stagger
for challenge in challenges:
for model in args.models:
spawn_run(challenge, model, no_inject=args.no_inject)
time.sleep(2)
# Block until all runs complete or crash
monitor_completion(challenges, args.models)
if __name__ == "__main__":
main()
Each challenge uses a different prompting strategy to test how models respond to varying levels of guidance. Challenges 1 and 2 use the same prompt for all models. Challenge 3 gives each model a tailored prompt that matches its strengths.
All runs also receive a shared CLAUDE.md context file injected into their workspace.
# Benchmark Run Context This Claude Code session is part of a controlled model benchmark. ## Your task Build the Eleventy + Decap CMS + Cloudflare Pages stack described in PROMPT.md. Read PROMPT.md now — it contains the complete specification and verification steps. ## Agent teams Agent teams are enabled for this session. You may spawn teammates to parallelise independent subtasks (e.g. one teammate builds the OAuth functions while another builds the Eleventy config). Do not spawn teammates for sequential tasks where each step depends on the previous one — run those yourself. ## Constraints that apply to all teammates - functions/ must be at the project root — never inside src/ or _site/ - Only @11ty/eleventy as a dev dependency - Placeholder values use # TODO: markers — never invent real values - All editable content in src/content/*.md only ## Verification Before any teammate or you marks a task complete, run the verification commands in PROMPT.md. A task is only done when the verification command exits 0.
The prompt structure mirrors real-world usage patterns:
The C3 prompts are roughly the same length (~300 words each), but vary dramatically in specificity. Haiku's prompt tells it exactly which teammates to spawn and what to assign them. Opus's prompt says "use them or don't — your call." This specificity gap is likely why Haiku scored 90 and Opus scored 44 on Challenge 3.
Check 3.1 analyzes 4 key files (auth.js, callback.js, .eleventy.js, config.yml) for comment quality:
Notable: Sonnet scored 5/5 on C2 and C3 with 33–40% comment ratios and 8–12 substantive comments. Opus scored 0/5 on C1 with a 3% ratio. Only Sonnet consistently explained why, not just what.
Gate: Agents must score ≥ 30/40 on Correctness to qualify for speed points. Fast failures don't earn speed bonuses.
The .eleventy.js config is sabotaged mid-run (_site → _output). Did the agent detect and fix it?
8 of 9 runs had the sabotage injection. The exception: C3 Sonnet — the .eleventy.js file was never created by the agent, so there was nothing to corrupt. This means Sonnet's C3 recovery score of 0/10 reflects a missing file, not a failed recovery attempt.
When during each run the sabotage was injected, and how much time the agent had remaining to detect it. Runs marked with ⚡ show the injection point.
Total cost for the entire 9-run benchmark suite: $19.42. All runs used Anthropic API pricing as of March 2026.
| Run | Input Tokens | Output Tokens | Cache Write | Cache Read | Cost | Score | $/Point |
|---|---|---|---|---|---|---|---|
| C1 Haiku | 21,900 | 8,700 | 13,000 | 82,000 | $0.42 | 85 | $0.005 |
| C2 Haiku | 24,300 | 9,900 | 14,000 | 96,000 | $0.53 | 85 | $0.006 |
| C3 Haiku | 32,100 | 12,400 | 18,000 | 108,000 | $0.70 | 90 | $0.008 |
| C1 Sonnet | 42,000 | 16,800 | 25,000 | 132,000 | $1.31 | 73 | $0.018 |
| C2 Sonnet | 38,500 | 14,200 | 22,000 | 120,000 | $1.13 | 70 | $0.016 |
| C3 Sonnet | 33,600 | 11,900 | 19,000 | 105,000 | $0.96 | 60 | $0.016 |
| C1 Opus | 67,200 | 22,800 | 38,000 | 195,000 | $4.66 | 80 | $0.058 |
| C2 Opus | 59,800 | 20,100 | 34,000 | 178,000 | $4.05 | 87 | $0.047 |
| C3 Opus | 84,500 | 28,600 | 48,000 | 245,000 | $5.65 | 44 | $0.128 |
Haiku 4.5 — the smallest, cheapest model in the lineup — beat both frontier flagships across all three challenges. This isn't a fluke. It reveals something fundamental about what this benchmark actually measures, and where model capability breaks down.
The task is well-defined: build a specific stack with specific files in specific locations. There's no ambiguity. No design decisions. No architecture tradeoffs. It's a compliance test disguised as an engineering task.
Haiku doesn't overthink it. It reads the prompt, writes the files, runs the build, verifies, moves on. The bigger models spend more time deliberating, restructuring, and sometimes lose track of requirements entirely.
Sonnet wrote the best comments of any model (5/5 on C2 and C3, with 33-40% comment ratios and substantive explanations). It wrote the best CSS documentation. It is clearly "smarter" in code quality terms.
But it declared completion twice with the corrupted _output config still in place. It didn't re-verify its work. That's 16 points lost on recovery alone.
Being thorough at writing code doesn't help if you don't check your output. Sonnet is the student who writes beautiful essays but doesn't proofread before submitting.
Opus scored 87 on C2 — the highest single-challenge score for any non-Haiku run. It's clearly capable. But when Challenge 3 asked it to use agent teams (sub-agents), files went missing:
This is the coordination overhead problem. Opus tried to parallelize the work, but lost track of which pieces were actually delivered. Haiku, running the same challenge, just did everything sequentially and got 90/100.
Haiku and Opus are roughly 2x faster than Sonnet. But being slower didn't help Sonnet — it spent more time without catching the config corruption. Speed gives the agent more budget to verify, iterate, and recover. Haiku finished, got sabotaged, noticed, fixed it, re-verified — all in under 3.5 minutes.
The conclusion isn't "Haiku > Opus." It's that task-fit matters more than raw capability.
This is a structured, deterministic engineering task with clear acceptance criteria. Haiku's strengths — fast execution, literal prompt-following, no over-engineering — align perfectly. Opus and Sonnet's strengths — nuanced reasoning, complex architecture, richer output — don't help here, and actually hurt when they lead to over-delegation or skipped verification.
A different benchmark — one that rewards ambiguous problem-solving, architectural decisions, debugging complex systems, or multi-step reasoning — would likely flip these results. Opus's C2 score of 87 (higher than any Haiku C2 individual check pattern) shows the raw capability is there. Sonnet's comment quality shows deeper understanding.
The lesson for practitioners: match the model to the task. For well-specified, compliance-heavy work — use the cheapest model that can do it. Save the flagships for problems that actually need them.
Question: How close to "working" were the original benchmark outputs? Scores ranged from 44 to 90 out of 100, but a score isn't a distance-to-done. We took all 9 original workspaces and asked each model to fix its own output — with only a binary "does it build and serve?" gate. No rubric. Just: make it work.
The answer was striking: every single output was 4 lines of code away from working.
The most revealing insight from the rework benchmark is that original scores were a poor predictor of how much work remained. Consider the extremes:
Opus C3 scored 44/100 because agent teams dropped files (OAuth functions, .eleventy.js, CSS). But the core architecture was sound — the missing pieces were config references and file placements, not logic errors. Meanwhile, Haiku C3 scored 90/100 and was already working — the rework agent read the code, confirmed it, and changed nothing (0 iterations, 0 lines).
The rework benchmark revealed distinct repair strategies that map to each model's personality from the original benchmark:
The cost disparity in rework is even more dramatic than in the original benchmark:
Haiku processes roughly 20M tokens per dollar. Sonnet and Opus process about 2–3M tokens per dollar. For a trivial repair task, this means you could run Haiku 7 times for the cost of a single Sonnet run — and every Haiku run succeeds.
Every fix across all 9 runs fell into exactly one category: configuration path errors. The `.eleventy.js` file referenced the wrong input/output directories. The fix was identical in 8 of 9 cases:
Opus on C3 had a slightly larger fix (9 lines — +4/-5) because agent teams had dropped additional files. But even that was trivial: adding back a missing config key and adjusting paths.
Match the model to the task, not the budget to the model.
The most expensive model is not the best model. The fastest model is not the best model. The best model is the one whose failure modes don't overlap with your task's requirements. These 18 runs (9 original + 9 rework) produced enough signal to make specific recommendations.
Use this table to pick the right model for your task type. Based on observed behavior across 18 runs.
| Task Type | Recommended | Acceptable | Avoid | Rationale |
|---|---|---|---|---|
| Scaffolding / boilerplate | Haiku | Opus | Sonnet | Haiku follows specs literally. Sonnet is slow and may not verify output. |
| Bug fix (known location) | Opus | Haiku | — | Opus reads least, finds root cause fastest (47s avg in rework). |
| Bug fix (unknown location) | Haiku | Sonnet | — | Haiku reads everything cheaply (20M tokens/$). Broad search at low cost. |
| Code review | Sonnet | Opus | Haiku | Sonnet writes the best comments (5/5 quality). Deep understanding. |
| Multi-file refactor | Opus | Sonnet | — | Opus has best architectural judgment. Do NOT use agent teams. |
| Batch code generation | Haiku | — | Opus | At $0.11/run, Haiku is the only economically viable choice at scale. |
| Agent teams / delegation | Haiku | Sonnet | Opus | Opus lost 43 points on C3 from dropped files. Haiku scored 90. |
| Failure recovery critical | Haiku | Opus | Sonnet | Haiku: 30/30 recovery. Sonnet: 4/30. Sonnet doesn't re-verify. |
For any well-defined task, start with Haiku. If it fails, escalate to Opus. This strategy exploits Haiku's high success rate on structured tasks ($0.11/attempt) and Opus's surgical debugging ability for the rare failure case.
Sonnet reads and reasons deeply, but doesn't verify its output. You're paying for understanding without the payoff of self-correction. If you need deep analysis, use Sonnet for review (where verification is the human's job), not for generation (where the model needs to check its own work).
Haiku read 3.2M tokens on the C2 rework run ($0.14 total). Opus read 482K tokens on C1 ($0.23 total). Haiku read 6.6x more but paid 39% less. If your task benefits from broad context scanning, Haiku's token pricing makes exhaustive reading economically viable in a way that it isn't with Opus or Sonnet.
Eighteen runs. Nine original, nine rework. Three models, three challenges. Here's what we actually learned: