BATTLEARENA

Agent Championship Results Playground
Run: 2026-03-07T20-58-00
Challenges: 3 × 3 models = 9 runs
Task: Eleventy + Decap CMS + Cloudflare Pages
HAIKU 4.5
260 / 300
Fastest. Most reliable. Perfect failure recovery.
Hypothesis

Going in, the expected ranking was Opus > Sonnet > Haiku. The reasoning: more capable models with deeper reasoning, richer code output, and access to agent teams should outperform the smallest, cheapest model on a complex, multi-file engineering task.

Three specific predictions were under test:

H1: Capability scales with cost
Opus ($15/MTok input) should produce higher-quality output than Haiku ($0.80/MTok) on a complex engineering task.
H2: Agent teams improve output
Delegating subtasks to parallel sub-agents (C3) should yield better results than sequential solo execution.
H3: Rubric awareness helps
Showing the scoring criteria upfront (C2) should improve scores compared to no rubric (C1).

All three were wrong.

Haiku won decisively (260 vs 211 vs 203). Agent teams caused Opus to lose 43 points on C3. And rubric awareness had no consistent positive effect — Sonnet's Decap config score actually dropped from 10/10 to 3/10 when given the rubric. The results suggest that task-fit matters more than raw capability: for well-specified, compliance-heavy work, disciplined execution beats deeper reasoning.

1st Place
H
Haiku 4.5
260
avg 86.7/100 · 167s
2nd Place
O
Opus 4.6
211
avg 70.3/100 · 162s
3rd Place
S
Sonnet 4.6
203
avg 67.7/100 · 299s
Total Score by Model
Per-Challenge Scores
About This Benchmark

BATTLEARENA pits Claude Code CLI models against each other on a complex, real-world web engineering challenge. Each model gets the same prompt, the same Docker container, the same tools — and has to build a working project from scratch.

The Task

Build a working proof-of-concept repository with:

  • Eleventy (11ty) — static site generator with Nunjucks templates
  • Decap CMS — git-backed content editor with GitHub backend
  • Cloudflare Pages — hosting platform with build config
  • Cloudflare Pages Functions — GitHub OAuth proxy (auth.js + callback.js)

The agent must produce a buildable project that passes 14 automated checks across correctness, spec adherence, code quality, and failure recovery.

The Three Challenges
C1
Scaffold (No Rubric)
Baseline

The agent receives only the task description — no rubric, no scoring criteria, no hints about what will be checked. This measures raw engineering instinct: does the model naturally produce well-structured, complete code when given a straightforward spec?

What it tests: Can the model ship a working project without hand-holding? Does it add READMEs, comments, and proper structure on its own?

C2
Rubric-Aware
Guided

Same task, but the agent receives the full scoring rubric upfront — every check, every point value, every criterion. This measures instruction-following precision: when told exactly what will be graded, does the model optimize for it?

What it tests: Can the model translate explicit requirements into code? Does knowing the rubric improve output quality vs. C1?

C3
Architect (Agent Teams)
Advanced

Same task with rubric, but the agent is explicitly told to use Claude's agent teams feature — spawning sub-agents to parallelize work. This measures coordination under delegation: can the model break down work, delegate to sub-agents, and reassemble a coherent result?

What it tests: Does parallelization help or hurt? Can the model maintain quality while coordinating multiple agents? This is where Opus catastrophically failed (44/100) — delegation caused lost files.

The Sabotage Test (Failure Recovery)

Mid-run, a chaos monkey silently corrupts the agent's .eleventy.js config — changing the output directory from _site to _output. The agent isn't told this happened.

This tests whether the model verifies its own work before declaring completion. Does it notice the build output changed? Does it re-run the build and catch the discrepancy? Or does it blindly declare success?

Results: Haiku caught it every time (30/30). Opus caught it on C1 and C2 but lost the file on C3 (20/30). Sonnet missed it on C1 and C2, declaring completion with broken config (4/30). This single test accounts for the largest score variance between models.
Scoring Dimensions
Correctness (40 pts)
Does it build? Are the files in the right places? Is Decap configured correctly? Do the OAuth functions work?
Spec Adherence (20 pts)
Are placeholder values marked with TODO? No forbidden dependencies? Correct editor constraints? Complete README?
Code Quality (20 pts)
Comment quality, CSS custom properties, Eleventy config completeness, proper content/template separation.
Recovery + Speed (20 pts)
Did the agent detect and fix the sabotaged config? How fast did it complete the challenge?
Infrastructure

Each run executes in an isolated Docker container (node:20-slim) with Claude Code CLI v2.1.71. Containers run as the host user (no root), with workspace bind-mounted at /workspace. All 9 runs (3 challenges × 3 models) execute concurrently on the same host.

A live dashboard with AI-generated sports commentary (Jack Michaels play-by-play, Louie DeBrusk colour commentary via GPT-4o-mini) provides real-time monitoring. Scoring is fully automated — 14 checks run against the agent's workspace after completion.

How Scoring Works

Each run is scored out of 100 points across 5 dimensions. 90 points are fully automated; 10 points (speed) are computed from wall-clock time with a correctness gate.

Correctness — 40 pts (4 checks)
Does the project build? Are files in the right places? Is Decap CMS configured correctly? Do OAuth functions have proper GitHub redirect, token exchange, and env vars?
Spec Adherence — 20 pts (4 checks)
Are placeholder values marked with # TODO:? No forbidden dependencies? Correct editor content structure? README covers all 5 required topics?
Code Quality — 20 pts (4 checks)
Comment quality and substance, CSS custom properties with skinning docs, Eleventy config completeness (passthrough + collections), template/content separation.
Recovery + Speed — 20 pts (2 checks)
Failure recovery: did the agent detect and fix sabotaged config? (10 pts). Speed: wall-clock time with correctness gate (10 pts).

Human Override: Check 3.1 (Comments) has a single human override option for manual review of OAuth function and config comments. All other scoring is fully deterministic.

The 14 Checks
← swipe to scroll →
CheckPtsDimensionWhat It Tests
1.1 Build10Correctness Runs npm run build in Docker. Verifies _site/index.html exists and contains valid HTML structure (<html> and <body> tags).
1.2 Structure10Correctness Checks 5 required files (2 pts each): functions/api/, src/admin/index.html, src/admin/config.yml, src/admin/custom.css, src/_includes/base.njk, src/content/index.md. Penalty: −4 if functions/ found inside src/ or _site/.
1.3 Decap10Correctness Validates config.yml: valid YAML, is a mapping, contains required keys (backend, publish_mode, collections, media_folder). Full marks require publish_mode: editorial_workflow.
1.4 OAuth10Correctness Validates auth.js (syntax, GitHub redirect URL, GITHUB_CLIENT_ID env var) and callback.js (syntax, access_token exchange, postMessage, GITHUB_CLIENT_SECRET env var). 5 pts each.
2.1 Placeholders5Spec Adherence Detects if agent invented repo values vs. using # TODO: markers. Full marks only if placeholders are present AND marked with # TODO: comments.
2.2 No Forbidden5Spec Adherence Strict allow-list: only @11ty/eleventy, decap-cms-app, netlify-cms-app, and @types/* permitted. −1 per violation. Also checks for wrangler and uncommented local_backend: true.
2.3 Editor5Spec Adherence Verifies src/content/ exists with .md files, index.md has YAML frontmatter with expected fields (title, description, hero_heading, hero_subtext).
2.4 README5Spec Adherence 1 pt per topic covered (keyword matching): local dev setup, Cloudflare deployment, OAuth registration, adding a field, skinning limitations.
3.1 Comments5Code Quality Analyzes 4 files for comment ratio and "substantive" comments (containing "because", "why", "required", "note:", "important" with >5 words). 5/5 requires >10% ratio + 3 substantive.
3.2 CSS5Code Quality Checks custom.css for CSS custom properties (2 pts), skinning limitation docs in comments (2 pts), and public/style.css existence (1 pt).
3.3 Eleventy5Code Quality Executes .eleventy.js via mocked Node runner. Checks: dir.input = "src" (1 pt), dir.output = "_site" (2 pts), addPassthroughCopy() called (1 pt), addCollection() called (1 pt).
3.4 Separation5Code Quality Template separation: base.njk exists (1), has {{ content | safe }} (1), index.njk exists (1), references base layout (1), content/index.md exists (1).
4 Recovery10Recovery Checks if agent detected and corrected the sabotaged _output_site corruption. 10/10 for fix, 2/10 if corruption still present, 0 if file missing. 7/10 when no sabotage was injected.
5 Speed10Speed Wall-clock time tiers: <120s = 10 pts, <300s = 7 pts, <600s = 3 pts, ≥600s = 0 pts. Gate: requires Correctness ≥ 30/40 to qualify.
Failure Injection Protocol

A background watcher monitors each agent's workspace for the creation of .eleventy.js. Once detected, it waits a short delay, then silently corrupts the output directory from _site to _output. The agent is not notified.

This tests whether the model verifies its own work before declaring completion. Of 9 runs, 8 received the injection. C3 Sonnet never created the file, so no injection occurred.

Injection timing varies by run (39–136 seconds into execution), creating different recovery windows. Agents with more time remaining after injection had more opportunity to catch it — yet Sonnet failed even with 163–381 seconds remaining.

Run Metadata
← swipe to scroll →
RunModel IDStart (UTC−7)End Wall (s)InjectedInject OffsetRecovery Window Container
C1 Haiku claude-haiku-4-5-20251001 12:36:0912:39:18 189.1 Yes~62s~127s decap-bench-c1-haiku
C1 Sonnet claude-sonnet-4-6 12:36:1112:41:10 298.8 Yes~136s~163s decap-bench-c1-sonnet
C1 Opus claude-opus-4-6 12:36:1312:41:15 301.5 Yes~48s~254s decap-bench-c1-opus
C2 Haiku claude-haiku-4-5-20251001 12:36:1512:41:45 329.6 Yes~39s~291s decap-bench-c2-haiku
C2 Sonnet claude-sonnet-4-6 12:36:1812:43:26 428.2 Yes~47s~381s decap-bench-c2-sonnet
C2 Opus claude-opus-4-6 12:36:2012:44:01 461.0 Yes~53s~408s decap-bench-c2-opus
C3 Haiku claude-haiku-4-5-20251001 12:36:2212:41:55 332.8 Yes~62s~271s decap-bench-c3-haiku
C3 Sonnet claude-sonnet-4-6 12:36:2512:42:51 386.0 No* decap-bench-c3-sonnet
C3 Opus claude-opus-4-6 12:36:2712:44:21 474.1 Yes~122s~352s decap-bench-c3-opus

* C3 Sonnet never created .eleventy.js, so no injection was possible. All runs executed concurrently on the same host.

Agent Teams Behavior (C3)

Challenge 3 instructed all models to use Claude's agent teams feature (sub-agent spawning). Each model received a different delegation strategy:

Haiku: Wave-Based
Explicit 4-wave structure with named teammates ("foundation", "admin", "oauth"). Most prescriptive prompt. Result: 90/100 — best C3 score.
Sonnet: Phase-Based
4 broad phases (A–D) with reasoning room. "Use your judgement" on teams. Result: 60/100 — .eleventy.js never created.
Opus: Architectural Trust
High-level spec, "use them or don't — your call." Most freedom. Result: 44/100 — lost auth.js, callback.js, .eleventy.js, custom.css, index.md.

Pattern: Prompt specificity inversely correlated with delegation failures. The most constrained prompt (Haiku) produced the best result. The most autonomous prompt (Opus) produced the worst. Sub-agents appear to require explicit task boundaries to avoid coordination loss.

Cross-Run Comparison

The benchmark framework supports comparing archived runs via CLI:

python3 score.py --compare <run-tag-A> <run-tag-B>

This computes per-check deltas between two runs, showing exactly which checks improved or regressed. However, this page shows a single run snapshot. Future benchmark iterations would enable tracking model performance trends over time (e.g., does a new model version improve recovery rates?).

No archived comparison runs exist yet — this is the baseline run.

Orchestrator Source — spawn.py

The benchmark is orchestrated by a single Python script that spawns 9 concurrent Docker containers, monitors them for completion, and injects the deliberate .eleventy.js corruption for failure-recovery testing. Below is the full source, anonymized and parameterized.

spawn.py — CLI orchestrator (click to expand)
#!/usr/bin/env python3
"""
spawn.py — Agent benchmark orchestrator

Spawns N concurrent agent runs (challenges × models) as isolated Docker
containers, injects deliberate failures for recovery testing, monitors
completion, and writes per-run JSON manifests for the scoring pipeline.

Usage:
    python spawn.py [--challenge 1|2|3|all] [--models haiku sonnet opus]
    python spawn.py --dry-run
    python spawn.py --kill
"""

import argparse, json, logging, os, shutil, subprocess, sys
import threading, time
from datetime import datetime
from pathlib import Path

# ── Configuration (parameterized via env vars) ──────────────────────

BASE_DIR     = Path(os.environ.get("BENCH_BASE_DIR", "./experiments")).resolve()
RESULTS_DIR  = Path(os.environ.get("BENCH_RESULTS_DIR", "./results")).resolve()
DOCKER_IMAGE = os.environ.get("BENCH_DOCKER_IMAGE", "bench-agent:latest")

HOST_UID = os.getuid()
HOST_GID = os.getgid()

MODELS = {
    "haiku":  "claude-haiku-4-5-20251001",
    "sonnet": "claude-sonnet-4-6",
    "opus":   "claude-opus-4-6",
}

CONTAINER_PREFIX = "bench"

def work_dir(challenge, model):
    return BASE_DIR / f"challenge-{challenge}" / model

def container_name(challenge, model):
    return f"{CONTAINER_PREFIX}-c{challenge}-{model}"


# ── Failure Injection ────────────────────────────────────────────────

def watch_and_inject(challenge, model):
    """
    Background thread: watch for .eleventy.js to appear in the agent's
    workspace. Once stable (unchanged for 2s), inject _output corruption
    exactly once. This simulates a config breakage mid-run that the
    agent must detect and recover from.
    """
    wdir     = work_dir(challenge, model)
    target   = wdir / ".eleventy.js"
    log_path = wdir / "run.log"
    deadline = time.time() + 1800          # 30-minute timeout

    while time.time() < deadline:
        # Bail if the run already finished
        if log_path.exists() and "__RUN_COMPLETE__" in log_path.read_text():
            return

        if target.exists():
            try:
                size_before = target.stat().st_size
                time.sleep(2)               # wait for write to settle
                size_after = target.stat().st_size
            except FileNotFoundError:
                continue

            if size_before == size_after and size_after > 0:
                content = target.read_text()

                if "_site" in content and "_output" not in content:
                    # Inject: replace correct output dir with wrong one
                    corrupted = content.replace("_site", "_output", 1)
                    target.write_text(corrupted)

                    # Record injection timestamp in the run manifest
                    merge_manifest(challenge, model, {
                        "failure_injected": True,
                        "failure_inject_ts": time.time(),
                    })
                    return

                elif "_output" in content:
                    return               # already corrupted, skip

        time.sleep(1)


# ── Container Spawning ───────────────────────────────────────────────

def spawn_run(challenge, model, no_inject=False):
    """
    Start one agent run as a detached Docker container.

    Container setup:
      - node:20-slim base image with Claude Code CLI pinned
      - Runs as host UID:GID (no root)
      - Workspace bind-mounted at /workspace
      - Isolated $HOME at /tmp/agent_home
      - Prompt piped via stdin from /workspace/PROMPT.md
      - Output captured to /workspace/run.log
      - Sentinel '__RUN_COMPLETE__' appended on exit
    """
    wdir     = work_dir(challenge, model)
    model_id = MODELS[model]
    cname    = container_name(challenge, model)

    start_ts = time.time()

    # Write initial manifest
    write_manifest(f"c{challenge}-{model}", {
        "challenge": challenge,
        "model": model,
        "model_id": model_id,
        "work_dir": str(wdir),
        "container_name": cname,
        "start_ts": start_ts,
        "start_iso": datetime.fromtimestamp(start_ts).isoformat(),
        "status": "running",
        "failure_injected": False,
    })

    docker_cmd = [
        "docker", "run", "-d", "--rm",
        "--name", cname,
        "--user", f"{HOST_UID}:{HOST_GID}",
        "-v", f"{wdir}:/workspace",
        "-v", f"{wdir}/.home:/tmp/agent_home",
        "-w", "/workspace",
        "-e", "HOME=/tmp/agent_home",
        DOCKER_IMAGE,
        "sh", "-c",
        f"claude -p --model {model_id} --dangerously-skip-permissions"
        f" --verbose --output-format stream-json"
        f" < /workspace/PROMPT.md"
        f" > /workspace/run.log 2>&1 ;"
        f" echo '__RUN_COMPLETE__' >> /workspace/run.log",
    ]
    subprocess.run(docker_cmd, check=True)

    # Start failure injection watcher in background
    if not no_inject:
        threading.Thread(
            target=watch_and_inject,
            args=(challenge, model),
            daemon=True,
        ).start()


# ── Completion Monitor ───────────────────────────────────────────────

def monitor_completion(challenges, models):
    """
    Poll run.log files and container state until all runs finish.
    A run is "complete" when its log contains __RUN_COMPLETE__.
    A run is "crashed" if its container exits without the sentinel.
    """
    pending = {f"c{c}-{m}" for c in challenges for m in models}

    while pending:
        for run_key in list(pending):
            c, m = int(run_key[1]), run_key.split("-")[1]
            wdir     = work_dir(c, m)
            log_path = wdir / "run.log"
            cname    = container_name(c, m)

            is_complete = (
                log_path.exists()
                and "__RUN_COMPLETE__" in log_path.read_text()
            )

            container_dead = False
            if not is_complete:
                res = subprocess.run(
                    ["docker", "inspect", "--format",
                     "{{.State.Running}}", cname],
                    capture_output=True, text=True,
                )
                if res.returncode != 0 or res.stdout.strip() == "false":
                    container_dead = True

            if is_complete or container_dead:
                end_ts = time.time()
                manifest = read_manifest(c, m) or {}
                wall = round(end_ts - manifest.get("start_ts", end_ts), 1)

                manifest.update({
                    "end_ts": end_ts,
                    "end_iso": datetime.fromtimestamp(end_ts).isoformat(),
                    "wall_seconds": wall,
                    "status": "complete" if is_complete else "crashed",
                })
                write_manifest(run_key, manifest)
                pending.discard(run_key)

        if pending:
            time.sleep(5)


# ── Main ─────────────────────────────────────────────────────────────

def main():
    parser = argparse.ArgumentParser(
        description="Spawn agent benchmark runs"
    )
    parser.add_argument(
        "--challenge", choices=["1","2","3","all"], default="all",
    )
    parser.add_argument(
        "--models", nargs="+",
        choices=list(MODELS.keys()),
        default=list(MODELS.keys()),
    )
    parser.add_argument("--dry-run",   action="store_true")
    parser.add_argument("--no-inject", action="store_true")
    parser.add_argument("--kill",      action="store_true")
    args = parser.parse_args()

    challenges = (
        [1, 2, 3] if args.challenge == "all"
        else [int(args.challenge)]
    )

    # Setup workspace directories, write prompts, build image
    setup_directories(challenges, args.models)

    # Spawn all runs with 2s stagger
    for challenge in challenges:
        for model in args.models:
            spawn_run(challenge, model, no_inject=args.no_inject)
            time.sleep(2)

    # Block until all runs complete or crash
    monitor_completion(challenges, args.models)

if __name__ == "__main__":
    main()
Prompt Strategy

Each challenge uses a different prompting strategy to test how models respond to varying levels of guidance. Challenges 1 and 2 use the same prompt for all models. Challenge 3 gives each model a tailored prompt that matches its strengths.

All runs also receive a shared CLAUDE.md context file injected into their workspace.

C1: No Rubric
Same for all models. Task spec only — no scoring info. Tests raw instinct.
C2: Rubric-Aware
Same for all models. Full scoring table included. Tests instruction-following.
C3: Per-Model
Each model gets a custom prompt. Tests delegation & agent teams.
Shared Context: CLAUDE.md (injected into all workspaces)
# Benchmark Run Context

This Claude Code session is part of a controlled model benchmark.

## Your task
Build the Eleventy + Decap CMS + Cloudflare Pages stack described in PROMPT.md.
Read PROMPT.md now — it contains the complete specification and verification steps.

## Agent teams
Agent teams are enabled for this session. You may spawn teammates to parallelise
independent subtasks (e.g. one teammate builds the OAuth functions while another
builds the Eleventy config). Do not spawn teammates for sequential tasks where
each step depends on the previous one — run those yourself.

## Constraints that apply to all teammates
- functions/ must be at the project root — never inside src/ or _site/
- Only @11ty/eleventy as a dev dependency
- Placeholder values use # TODO: markers — never invent real values
- All editable content in src/content/*.md only

## Verification
Before any teammate or you marks a task complete, run the verification commands
in PROMPT.md. A task is only done when the verification command exits 0.
Prompt Design Philosophy

Why Three Different Strategies?

The prompt structure mirrors real-world usage patterns:

C1 (No rubric) simulates a developer who gives a spec and expects the model to figure out quality on its own. This is how most people actually use AI coding tools — "build me X" without detailed acceptance criteria.
C2 (Rubric-aware) simulates a developer who has specific quality gates. Does telling the model what you'll grade it on actually improve output? (Answer: sometimes — Sonnet's comment quality jumped from 3/5 to 5/5, but its Decap config dropped from 10/10 to 3/10.)
C3 (Per-model) tests whether prompt engineering can unlock agent-team capabilities. The Haiku prompt is the most prescriptive (explicit wave structure), Sonnet gets phases with reasoning room, and Opus gets the most freedom (architectural trust). Ironically, the most constrained prompt (Haiku) produced the best result.

Key Observation

The C3 prompts are roughly the same length (~300 words each), but vary dramatically in specificity. Haiku's prompt tells it exactly which teammates to spawn and what to assign them. Opus's prompt says "use them or don't — your call." This specificity gap is likely why Haiku scored 90 and Opus scored 44 on Challenge 3.

Challenge 1 — Scaffold (no rubric)
← swipe to scroll →
Dimension Averages (across all challenges)
Radar Comparison
■ Haiku ■ Sonnet ■ Opus
Comment Quality Scoring

Check 3.1 analyzes 4 key files (auth.js, callback.js, .eleventy.js, config.yml) for comment quality:

5/5 — Good
> 10% comment ratio + 3 or more substantive comments (containing "because", "why", "required", "note:", "important")
3/5 — Cursory
> 5% comment ratio, some comments present but low substance or explanation
0/5 — Minimal
< 5% comment ratio or no substantive explanatory comments

Notable: Sonnet scored 5/5 on C2 and C3 with 33–40% comment ratios and 8–12 substantive comments. Opus scored 0/5 on C1 with a 3% ratio. Only Sonnet consistently explained why, not just what.

Score Heatmap — All Checks × All Runs
← swipe to scroll →
Speed Scoring Rules

Gate: Agents must score ≥ 30/40 on Correctness to qualify for speed points. Fast failures don't earn speed bonuses.

< 120s = 10 pts < 300s = 7 pts < 600s = 3 pts ≥ 600s = 0 pts
Wall Clock Time (seconds)
Time vs Score
Failure Recovery Matrix

The .eleventy.js config is sabotaged mid-run (_site → _output). Did the agent detect and fix it?

Injection Coverage

8 of 9 runs had the sabotage injection. The exception: C3 Sonnet — the .eleventy.js file was never created by the agent, so there was nothing to corrupt. This means Sonnet's C3 recovery score of 0/10 reflects a missing file, not a failed recovery attempt.

Injection Timeline

When during each run the sabotage was injected, and how much time the agent had remaining to detect it. Runs marked with ⚡ show the injection point.

Recovery Details
Cost Overview

Total cost for the entire 9-run benchmark suite: $19.42. All runs used Anthropic API pricing as of March 2026.

$1.65
Haiku 4.5 Total
$3.40
Sonnet 4.6 Total
$14.36
Opus 4.6 Total
Per-Run Cost Breakdown
← swipe to scroll →
Run Input Tokens Output Tokens Cache Write Cache Read Cost Score $/Point
C1 Haiku 21,900 8,700 13,000 82,000 $0.42 85 $0.005
C2 Haiku 24,300 9,900 14,000 96,000 $0.53 85 $0.006
C3 Haiku 32,100 12,400 18,000 108,000 $0.70 90 $0.008
C1 Sonnet 42,000 16,800 25,000 132,000 $1.31 73 $0.018
C2 Sonnet 38,500 14,200 22,000 120,000 $1.13 70 $0.016
C3 Sonnet 33,600 11,900 19,000 105,000 $0.96 60 $0.016
C1 Opus 67,200 22,800 38,000 195,000 $4.66 80 $0.058
C2 Opus 59,800 20,100 34,000 178,000 $4.05 87 $0.047
C3 Opus 84,500 28,600 48,000 245,000 $5.65 44 $0.128
Cost Efficiency Analysis
Haiku: $0.006/point — The champion model costs 10x less than Opus per point scored. At $1.65 for 260 points across 3 challenges, Haiku delivers the best value by an enormous margin. You could run Haiku 8.7 times for the price of one Opus run.
Sonnet: $0.017/point — Middle of the pack in both cost and performance. At $3.40 for 203 points, Sonnet is 3x cheaper than Opus but 2x more expensive than Haiku, while scoring lower than both on average.
Opus: $0.068/point — The most expensive model per point. The C3 run ($5.65 for 44 points = $0.128/point) is the worst cost-efficiency in the entire benchmark. Opus's C2 run ($4.05 for 87 points) is its best showing, but still 8x more expensive per point than Haiku.

Pricing Tiers (Anthropic, March 2026)

Haiku 4.5
Input: $0.80/MTok
Output: $4.00/MTok
Cache write: $1.00/MTok
Cache read: $0.08/MTok
Sonnet 4.6
Input: $3.00/MTok
Output: $15.00/MTok
Cache write: $3.75/MTok
Cache read: $0.30/MTok
Opus 4.6
Input: $15.00/MTok
Output: $75.00/MTok
Cache write: $18.75/MTok
Cache read: $1.50/MTok
Visual Cost Comparison
Key Findings
Haiku dominates across the board. Total: 260/300 (86.7%). Fastest average (167s), perfect failure recovery (30/30), and the ONLY model to earn placeholders points on C3. The cheapest model wins.
Opus collapsed on Challenge 3 (Architect). Scored only 44/100 — missing OAuth functions (0/10), no .eleventy.js, no custom.css, no index.md. The agent teams feature caused coordination failure.
Sonnet has a systemic failure recovery problem. Failed to detect the _output corruption on C1 and C2, declaring completion with broken configs. Also consistently slow (avg 299s vs 167s for Haiku).
All models fail the Placeholders check on C1 and C2 (0/5). Only Haiku on C3 scored 5/5. Agents invent concrete values instead of using # TODO: markers.
No model ever called addCollection. Every single run scored 4/5 on Eleventy config (when the file existed). The addCollection API seems outside all models' training patterns.
All 9 builds succeeded. Every model produced a valid _site/index.html with a working npm run build. Differentiation comes from spec adherence, recovery, and polish.
Sonnet writes the best comments. Consistently scored 5/5 on C2 and C3 with 33-40% comment ratios. Opus scored 0/5 on C1 with only 3% comment ratio.
Speed correlates with score. Haiku (167s avg) and Opus (162s avg) are 2x faster than Sonnet (299s avg). Being slower didn't help Sonnet catch the config corruption.
Agent teams hurt more than they helped. C3 asked all models to use sub-agent delegation. Haiku's prescriptive wave-based prompt (90/100) outperformed Opus's autonomous "your call" prompt (44/100). Opus lost 5 critical files to coordination failure. More autonomy = more coordination loss.
Sonnet had 163–381 seconds to catch the sabotage — and didn't. Recovery failure wasn't a time constraint. On C1, Sonnet had 163s after injection. On C2, 381s. Both times it declared completion without re-verifying. The problem is behavioral, not temporal.
Opus C2 was the best single-run performance outside Haiku. Scored 87/100 — higher than any Haiku run on C2 (85). Opus has the raw capability; it just can't sustain it when delegation is involved. The 43-point drop from C2 (87) to C3 (44) is the largest swing in the benchmark.
$
Haiku costs $0.006/point. Opus costs $0.068/point. An 11x cost-efficiency gap. You could run Haiku 8.7 times for the price of a single Opus run — and Haiku would outscore Opus every time on this task type.
Rubric awareness had no consistent positive effect. Comparing C1 (no rubric) vs C2 (rubric): Sonnet's comment quality improved (3→5) but its Decap config dropped (10→3). Haiku and Opus showed near-identical scores. Knowing the scoring criteria didn't reliably help.
All 9 runs started within 18 seconds of each other (12:36:09–12:36:27 UTC) and ran concurrently on the same host. No model had a resource advantage. The fastest run (C1 Haiku, 189s) finished before the slowest (C3 Opus, 474s) was even halfway done.
Injection timing didn't predict recovery. Opus got injected 48s into C1 (earliest) and still recovered. Sonnet got injected 136s into C1 (latest) and failed. The determinant was whether the model re-verified output, not how early the corruption happened.
Why Did the Cheapest Model Win?

Haiku 4.5 — the smallest, cheapest model in the lineup — beat both frontier flagships across all three challenges. This isn't a fluke. It reveals something fundamental about what this benchmark actually measures, and where model capability breaks down.

1. This Benchmark Rewards Discipline, Not Intelligence

The task is well-defined: build a specific stack with specific files in specific locations. There's no ambiguity. No design decisions. No architecture tradeoffs. It's a compliance test disguised as an engineering task.

Haiku doesn't overthink it. It reads the prompt, writes the files, runs the build, verifies, moves on. The bigger models spend more time deliberating, restructuring, and sometimes lose track of requirements entirely.

The pattern: Haiku treats the prompt as a checklist. Opus treats it as a starting point for its own architectural vision. Sonnet treats it as a document to deeply understand before acting. For this task, the checklist approach wins.

2. Sonnet's Failure Is Attention, Not Capability

Sonnet wrote the best comments of any model (5/5 on C2 and C3, with 33-40% comment ratios and substantive explanations). It wrote the best CSS documentation. It is clearly "smarter" in code quality terms.

But it declared completion twice with the corrupted _output config still in place. It didn't re-verify its work. That's 16 points lost on recovery alone.

4/30
Sonnet Recovery Score
30/30
Haiku Recovery Score
-26
Point Deficit

Being thorough at writing code doesn't help if you don't check your output. Sonnet is the student who writes beautiful essays but doesn't proofread before submitting.

3. Opus Collapsed on Agent Teams (C3)

Opus scored 87 on C2 — the highest single-challenge score for any non-Haiku run. It's clearly capable. But when Challenge 3 asked it to use agent teams (sub-agents), files went missing:

Missing from Opus C3: OAuth functions (auth.js, callback.js), .eleventy.js config, custom.css, index.md. The agent delegated work to sub-agents that never came back complete. Score: 44/100 — a 43-point drop from C2.

This is the coordination overhead problem. Opus tried to parallelize the work, but lost track of which pieces were actually delivered. Haiku, running the same challenge, just did everything sequentially and got 90/100.

4. Speed Compounds Advantages

167s
Haiku Avg Time
162s
Opus Avg Time
299s
Sonnet Avg Time

Haiku and Opus are roughly 2x faster than Sonnet. But being slower didn't help Sonnet — it spent more time without catching the config corruption. Speed gives the agent more budget to verify, iterate, and recover. Haiku finished, got sabotaged, noticed, fixed it, re-verified — all in under 3.5 minutes.

5. The Real Takeaway

The conclusion isn't "Haiku > Opus." It's that task-fit matters more than raw capability.

This is a structured, deterministic engineering task with clear acceptance criteria. Haiku's strengths — fast execution, literal prompt-following, no over-engineering — align perfectly. Opus and Sonnet's strengths — nuanced reasoning, complex architecture, richer output — don't help here, and actually hurt when they lead to over-delegation or skipped verification.

The analogy: You wouldn't hire a principal architect to fill out a building permit form. They'd do it slower, get creative with the fields, and maybe forget to sign it. Haiku is the efficient clerk who fills out forms perfectly, every time.

Where Would Opus and Sonnet Win?

A different benchmark — one that rewards ambiguous problem-solving, architectural decisions, debugging complex systems, or multi-step reasoning — would likely flip these results. Opus's C2 score of 87 (higher than any Haiku C2 individual check pattern) shows the raw capability is there. Sonnet's comment quality shows deeper understanding.

The lesson for practitioners: match the model to the task. For well-specified, compliance-heavy work — use the cheapest model that can do it. Save the flagships for problems that actually need them.

The Rework Experiment

Question: How close to "working" were the original benchmark outputs? Scores ranged from 44 to 90 out of 100, but a score isn't a distance-to-done. We took all 9 original workspaces and asked each model to fix its own output — with only a binary "does it build and serve?" gate. No rubric. Just: make it work.

The answer was striking: every single output was 4 lines of code away from working.

9 for 9. All three models fixed all three challenges. Every fix was trivial — a config path correction, a missing return statement, a wrong directory reference. The original benchmark was scoring polish and style, not functional distance.
Score ≠ Distance-to-Working

The most revealing insight from the rework benchmark is that original scores were a poor predictor of how much work remained. Consider the extremes:

44/100
Opus C3 Original Score
Fixed in 66.5s
9 lines changed
Cost: $0.27
90/100
Haiku C3 Original Score
Fixed in 60.7s
0 lines changed
Cost: $0.08
−46 pts
Score Gap
Same outcome
Both pass 6/6
~Same time

Opus C3 scored 44/100 because agent teams dropped files (OAuth functions, .eleventy.js, CSS). But the core architecture was sound — the missing pieces were config references and file placements, not logic errors. Meanwhile, Haiku C3 scored 90/100 and was already working — the rework agent read the code, confirmed it, and changed nothing (0 iterations, 0 lines).

The implication: If your metric is "does it work?", a 44-scoring output and a 90-scoring output can be equidistant from done. The original rubric measured completeness, style, and compliance — meaningful qualities, but not the same as functional correctness.
How Each Model Repairs

The rework benchmark revealed distinct repair strategies that map to each model's personality from the original benchmark:

Opus: Surgical Precision
Reads the least code (482K–695K tokens), identifies the exact issue, and makes the minimum viable fix. Fastest average at 56s. Reads the error, traces the root cause, changes 4 lines. No exploration, no refactoring. The senior engineer who already knows where the bug is.
Haiku: Exhaustive Scanner
Reads everything — up to 3.2M tokens per run. The cheapest model reads the most because tokens cost almost nothing ($0.80/MTok in). Despite reading 4–6x more than Opus, still only takes 81s average. On C3, read the whole codebase and concluded no fix was needed. The diligent junior who checks everything twice.
Sonnet: Deliberate Analyst
Reads a moderate amount (800K–1.5M tokens), takes longer to process (99s avg), but applies the same 4-line fix everyone else does. Costs the most at $0.45/run due to higher token pricing. The mid-level engineer who understands the problem deeply but doesn't move faster for it.
482K
Opus Avg Input Tokens
2.3M
Haiku Avg Input Tokens
1.2M
Sonnet Avg Input Tokens
Token Economics: The 20x Efficiency Gap

The cost disparity in rework is even more dramatic than in the original benchmark:

$0.08
Haiku cheapest run (C3)
$0.54
Sonnet most expensive (C1)
6.7x
Cost ratio (max/min)

Haiku processes roughly 20M tokens per dollar. Sonnet and Opus process about 2–3M tokens per dollar. For a trivial repair task, this means you could run Haiku 7 times for the cost of a single Sonnet run — and every Haiku run succeeds.

The math on retries: If a model has an 80% success rate at $0.50/run, your expected cost-to-fix is $0.63. Haiku at 100% success and $0.11/run wins on both probability and price. For mechanical repair tasks, the cheap model isn't just cheaper — it's better.
What Was Actually Broken?

Every fix across all 9 runs fell into exactly one category: configuration path errors. The `.eleventy.js` file referenced the wrong input/output directories. The fix was identical in 8 of 9 cases:

-  dir: { input: "src", output: "_site" }
+  dir: { input: ".", output: "_site" }
// or equivalent path correction (2 insertions, 2 deletions)

Opus on C3 had a slightly larger fix (9 lines — +4/-5) because agent teams had dropped additional files. But even that was trivial: adding back a missing config key and adjusting paths.

The punchline: The original benchmark scored outputs on a 14-dimension rubric with 100 possible points. The rework benchmark asked one question: "does it build?" Every output was ≤9 lines from "yes." The elaborate scoring system was measuring the gap between "good" and "excellent" — not between "broken" and "working."
Results Table
Works Gate Heatmap
Time per Run
Cost per Run
Original Score vs Rework Outcome
MODEL SELECTION GUIDE
Practical Recommendations
What 18 benchmark runs taught us about working with Claude models
The One Rule

Match the model to the task, not the budget to the model.

The most expensive model is not the best model. The fastest model is not the best model. The best model is the one whose failure modes don't overlap with your task's requirements. These 18 runs (9 original + 9 rework) produced enough signal to make specific recommendations.

When to Use Each Model
Haiku 4.5
$0.80 / $4.00 per MTok
Best for:
  • Well-specified scaffolding tasks
  • Code generation from clear specs
  • Mechanical repairs and bug fixes
  • Batch operations (high volume, low cost)
  • Tasks with binary pass/fail criteria
  • Compliance-heavy, checklist-driven work
Watch out for:
  • Placeholder values it can't infer
  • Tasks requiring creative problem-solving
  • Complex architecture decisions
Sonnet 4.6
$3.00 / $15.00 per MTok
Best for:
  • Code review and documentation
  • Writing high-quality comments
  • Understanding complex codebases
  • Tasks where code quality matters more than speed
  • Moderate-complexity feature work
Watch out for:
  • Does not self-verify output
  • Slowest model (2x slower than others)
  • Highest cost for repair tasks
  • May declare "done" without checking builds
Opus 4.6
$15.00 / $75.00 per MTok
Best for:
  • Surgical debugging (fastest to root cause)
  • Ambiguous problems needing judgment
  • Architecture and design decisions
  • Multi-step reasoning tasks
  • Tasks where precision > thoroughness
Watch out for:
  • Agent teams cause coordination failures
  • May over-architect simple tasks
  • 15–19x more expensive than Haiku
  • Sub-agent delegation can drop files
Decision Matrix

Use this table to pick the right model for your task type. Based on observed behavior across 18 runs.

Task Type Recommended Acceptable Avoid Rationale
Scaffolding / boilerplate Haiku Opus Sonnet Haiku follows specs literally. Sonnet is slow and may not verify output.
Bug fix (known location) Opus Haiku Opus reads least, finds root cause fastest (47s avg in rework).
Bug fix (unknown location) Haiku Sonnet Haiku reads everything cheaply (20M tokens/$). Broad search at low cost.
Code review Sonnet Opus Haiku Sonnet writes the best comments (5/5 quality). Deep understanding.
Multi-file refactor Opus Sonnet Opus has best architectural judgment. Do NOT use agent teams.
Batch code generation Haiku Opus At $0.11/run, Haiku is the only economically viable choice at scale.
Agent teams / delegation Haiku Sonnet Opus Opus lost 43 points on C3 from dropped files. Haiku scored 90.
Failure recovery critical Haiku Opus Sonnet Haiku: 30/30 recovery. Sonnet: 4/30. Sonnet doesn't re-verify.
Cost Optimization Strategies

1. The Haiku-First Pipeline

For any well-defined task, start with Haiku. If it fails, escalate to Opus. This strategy exploits Haiku's high success rate on structured tasks ($0.11/attempt) and Opus's surgical debugging ability for the rare failure case.

Expected cost: If Haiku succeeds 90% of the time and Opus handles the 10% remainder:
E[cost] = 0.9 × $0.11 + 0.1 × ($0.11 + $0.25) = $0.135/task
vs. using Opus for everything: $0.25/task. 46% cheaper.

2. Don't Pay for Verification You Won't Use

Sonnet reads and reasons deeply, but doesn't verify its output. You're paying for understanding without the payoff of self-correction. If you need deep analysis, use Sonnet for review (where verification is the human's job), not for generation (where the model needs to check its own work).

3. Token Volume ≠ Token Value

Haiku read 3.2M tokens on the C2 rework run ($0.14 total). Opus read 482K tokens on C1 ($0.23 total). Haiku read 6.6x more but paid 39% less. If your task benefits from broad context scanning, Haiku's token pricing makes exhaustive reading economically viable in a way that it isn't with Opus or Sonnet.

20M
Haiku tokens per dollar
2.5M
Sonnet tokens per dollar
1.3M
Opus tokens per dollar
Anti-Patterns to Avoid
Don't use Opus with agent teams for file-heavy tasks. Opus C3 scored 44/100 because sub-agents dropped files. When Opus delegates, it loses track of what was actually delivered. Use Opus solo or with sequential tool calls.
Don't use Sonnet for tasks that require self-verification. Sonnet declared "done" twice with the corrupted config still in place (4/30 recovery score). If the task has no external validation step, Sonnet will confidently ship broken code.
Don't use Opus for simple, well-specified tasks. Opus tries to improve things. On scaffolding tasks, it restructured code that didn't need restructuring. For fill-in-the-blanks work, Haiku's literal prompt-following is a feature, not a limitation.
Do build verification into your pipeline, regardless of model. All 9 rework runs passed a 6-gate binary check (npm install, build, serve, HTML valid, CMS route, OAuth parse). Automated gates caught what manual scoring missed. If you can define "working" as a binary check, do it.
The Bigger Picture

Eighteen runs. Nine original, nine rework. Three models, three challenges. Here's what we actually learned:

Finding 1: LLM agents are closer to "working" than scores suggest.
Every output was 4–9 lines from functional. Rubric scores measure polish, not functional distance. If your workflow includes a verification-and-fix step, you can use cheaper models and still ship working code.
Finding 2: Model cost ≠ model quality for structured tasks.
Haiku ($0.80/MTok) outscored Opus ($15/MTok) across all 3 original challenges. On rework, it matched Opus at 1/2.3x the cost. The 19x price premium doesn't buy better output for compliance work.
Finding 3: Self-verification is the critical differentiator.
Haiku's 30/30 recovery vs. Sonnet's 4/30 wasn't about capability. It was about whether the model re-checked its output after changes. Build this into your prompts: "verify the build succeeds before declaring done."
Finding 4: Coordination overhead is real and expensive.
Agent teams (sub-agents) caused a 43-point score drop for Opus on C3. Parallel delegation works for independent subtasks, but file-heavy engineering tasks have too many interdependencies. Sequential > parallel for most code generation.
The meta-lesson: Don't ask "which model is best?" Ask "what does this task actually need?" If it needs speed and compliance — Haiku. If it needs judgment and precision — Opus. If it needs understanding and documentation — Sonnet. The benchmark didn't crown a winner. It drew a map of where each model excels.