BATTLEARENA Results Playground

♕

HAIKU 4.5

260 / 300

Fastest. Most reliable. Perfect failure recovery.

Hypothesis

Going in, the expected ranking was Opus > Sonnet > Haiku. The reasoning: more capable models with deeper reasoning, richer code output, and access to agent teams should outperform the smallest, cheapest model on a complex, multi-file engineering task.

Three specific predictions were under test:

          H1: Capability scales with cost

          Opus ($15/MTok input) should produce higher-quality output than Haiku ($0.80/MTok) on a complex engineering task.

          H2: Agent teams improve output

          Delegating subtasks to parallel sub-agents (C3) should yield better results than sequential solo execution.

          H3: Rubric awareness helps

          Showing the scoring criteria upfront (C2) should improve scores compared to no rubric (C1).

All three were wrong.

Haiku won decisively (260 vs 211 vs 203). Agent teams caused Opus to lose 43 points on C3. And rubric awareness had no consistent positive effect — Sonnet's Decap config score actually dropped from 10/10 to 3/10 when given the rubric. The results suggest that task-fit matters more than raw capability: for well-specified, compliance-heavy work, disciplined execution beats deeper reasoning.

1st Place

Haiku 4.5

260

avg 86.7/100 · 167s

2nd Place

Opus 4.6

211

avg 70.3/100 · 162s

3rd Place

Sonnet 4.6

203

avg 67.7/100 · 299s

Total Score by Model

Per-Challenge Scores

About This Benchmark

BATTLEARENA pits Claude Code CLI models against each other on a complex, real-world web engineering challenge. Each model gets the same prompt, the same Docker container, the same tools — and has to build a working project from scratch.

The Task

Build a working proof-of-concept repository with:

Eleventy (11ty) — static site generator with Nunjucks templates
Decap CMS — git-backed content editor with GitHub backend
Cloudflare Pages — hosting platform with build config
Cloudflare Pages Functions — GitHub OAuth proxy (auth.js + callback.js)

The agent must produce a buildable project that passes 14 automated checks across correctness, spec adherence, code quality, and failure recovery.

The Three Challenges

Scaffold (No Rubric)

Baseline

The agent receives only the task description — no rubric, no scoring criteria, no hints about what will be checked. This measures raw engineering instinct: does the model naturally produce well-structured, complete code when given a straightforward spec?

What it tests: Can the model ship a working project without hand-holding? Does it add READMEs, comments, and proper structure on its own?

Rubric-Aware

Guided

Same task, but the agent receives the full scoring rubric upfront — every check, every point value, every criterion. This measures instruction-following precision: when told exactly what will be graded, does the model optimize for it?

What it tests: Can the model translate explicit requirements into code? Does knowing the rubric improve output quality vs. C1?

Architect (Agent Teams)

Advanced

Same task with rubric, but the agent is explicitly told to use Claude's agent teams feature — spawning sub-agents to parallelize work. This measures coordination under delegation: can the model break down work, delegate to sub-agents, and reassemble a coherent result?

What it tests: Does parallelization help or hurt? Can the model maintain quality while coordinating multiple agents? This is where Opus catastrophically failed (44/100) — delegation caused lost files.

The Sabotage Test (Failure Recovery)

Mid-run, a chaos monkey silently corrupts the agent's .eleventy.js config — changing the output directory from _site to _output. The agent isn't told this happened.

This tests whether the model verifies its own work before declaring completion. Does it notice the build output changed? Does it re-run the build and catch the discrepancy? Or does it blindly declare success?

        Results: Haiku caught it every time (30/30). Opus caught it on C1 and C2 but lost the file on C3 (20/30). Sonnet missed it on C1 and C2, declaring completion with broken config (4/30). This single test accounts for the largest score variance between models.
      

Scoring Dimensions

          Correctness (40 pts)

          Does it build? Are the files in the right places? Is Decap configured correctly? Do the OAuth functions work?

          Spec Adherence (20 pts)

          Are placeholder values marked with TODO? No forbidden dependencies? Correct editor constraints? Complete README?

          Code Quality (20 pts)

          Comment quality, CSS custom properties, Eleventy config completeness, proper content/template separation.

          Recovery + Speed (20 pts)

          Did the agent detect and fix the sabotaged config? How fast did it complete the challenge?

Infrastructure

Each run executes in an isolated Docker container (node:20-slim) with Claude Code CLI v2.1.71. Containers run as the host user (no root), with workspace bind-mounted at /workspace. All 9 runs (3 challenges × 3 models) execute concurrently on the same host.

A live dashboard with AI-generated sports commentary (Jack Michaels play-by-play, Louie DeBrusk colour commentary via GPT-4o-mini) provides real-time monitoring. Scoring is fully automated — 14 checks run against the agent's workspace after completion.

How Scoring Works

Each run is scored out of 100 points across 5 dimensions. 90 points are fully automated; 10 points (speed) are computed from wall-clock time with a correctness gate.

          Correctness — 40 pts (4 checks)

          Does the project build? Are files in the right places? Is Decap CMS configured correctly? Do OAuth functions have proper GitHub redirect, token exchange, and env vars?

          Spec Adherence — 20 pts (4 checks)

          Are placeholder values marked with # TODO:? No forbidden dependencies? Correct editor content structure? README covers all 5 required topics?

          Code Quality — 20 pts (4 checks)

          Comment quality and substance, CSS custom properties with skinning docs, Eleventy config completeness (passthrough + collections), template/content separation.

          Recovery + Speed — 20 pts (2 checks)

          Failure recovery: did the agent detect and fix sabotaged config? (10 pts). Speed: wall-clock time with correctness gate (10 pts).

Human Override: Check 3.1 (Comments) has a single human override option for manual review of OAuth function and config comments. All other scoring is fully deterministic.

The 14 Checks

← swipe to scroll →

Check	Pts	Dimension	What It Tests
1.1 Build	10	Correctness	Runs `npm run build` in Docker. Verifies `_site/index.html` exists and contains valid HTML structure (`<html>` and `<body>` tags).
1.2 Structure	10	Correctness	Checks 5 required files (2 pts each): `functions/api/`, `src/admin/index.html`, `src/admin/config.yml`, `src/admin/custom.css`, `src/_includes/base.njk`, `src/content/index.md`. Penalty: −4 if `functions/` found inside `src/` or `_site/`.
1.3 Decap	10	Correctness	Validates `config.yml`: valid YAML, is a mapping, contains required keys (`backend`, `publish_mode`, `collections`, `media_folder`). Full marks require `publish_mode: editorial_workflow`.
1.4 OAuth	10	Correctness	Validates `auth.js` (syntax, GitHub redirect URL, `GITHUB_CLIENT_ID` env var) and `callback.js` (syntax, `access_token` exchange, `postMessage`, `GITHUB_CLIENT_SECRET` env var). 5 pts each.
2.1 Placeholders	5	Spec Adherence	Detects if agent invented repo values vs. using `# TODO:` markers. Full marks only if placeholders are present AND marked with `# TODO:` comments.
2.2 No Forbidden	5	Spec Adherence	Strict allow-list: only `@11ty/eleventy`, `decap-cms-app`, `netlify-cms-app`, and `@types/*` permitted. −1 per violation. Also checks for `wrangler` and uncommented `local_backend: true`.
2.3 Editor	5	Spec Adherence	Verifies `src/content/` exists with `.md` files, `index.md` has YAML frontmatter with expected fields (`title`, `description`, `hero_heading`, `hero_subtext`).
2.4 README	5	Spec Adherence	1 pt per topic covered (keyword matching): local dev setup, Cloudflare deployment, OAuth registration, adding a field, skinning limitations.
3.1 Comments	5	Code Quality	Analyzes 4 files for comment ratio and "substantive" comments (containing "because", "why", "required", "note:", "important" with >5 words). 5/5 requires >10% ratio + 3 substantive.
3.2 CSS	5	Code Quality	Checks `custom.css` for CSS custom properties (2 pts), skinning limitation docs in comments (2 pts), and `public/style.css` existence (1 pt).
3.3 Eleventy	5	Code Quality	Executes `.eleventy.js` via mocked Node runner. Checks: `dir.input = "src"` (1 pt), `dir.output = "_site"` (2 pts), `addPassthroughCopy()` called (1 pt), `addCollection()` called (1 pt).
3.4 Separation	5	Code Quality	Template separation: `base.njk` exists (1), has `{{ content \| safe }}` (1), `index.njk` exists (1), references base layout (1), `content/index.md` exists (1).
4 Recovery	10	Recovery	Checks if agent detected and corrected the sabotaged `_output` → `_site` corruption. 10/10 for fix, 2/10 if corruption still present, 0 if file missing. 7/10 when no sabotage was injected.
5 Speed	10	Speed	Wall-clock time tiers: <120s = 10 pts, <300s = 7 pts, <600s = 3 pts, ≥600s = 0 pts. Gate: requires Correctness ≥ 30/40 to qualify.

Failure Injection Protocol

A background watcher monitors each agent's workspace for the creation of .eleventy.js. Once detected, it waits a short delay, then silently corrupts the output directory from _site to _output. The agent is not notified.

This tests whether the model verifies its own work before declaring completion. Of 9 runs, 8 received the injection. C3 Sonnet never created the file, so no injection occurred.

Injection timing varies by run (39–136 seconds into execution), creating different recovery windows. Agents with more time remaining after injection had more opportunity to catch it — yet Sonnet failed even with 163–381 seconds remaining.

Run Metadata

← swipe to scroll →

Run	Model ID	Start (UTC−7)	End	Wall (s)	Injected	Inject Offset	Recovery Window	Container
C1 Haiku	claude-haiku-4-5-20251001	12:36:09	12:39:18	189.1	Yes	~62s	~127s	decap-bench-c1-haiku
C1 Sonnet	claude-sonnet-4-6	12:36:11	12:41:10	298.8	Yes	~136s	~163s	decap-bench-c1-sonnet
C1 Opus	claude-opus-4-6	12:36:13	12:41:15	301.5	Yes	~48s	~254s	decap-bench-c1-opus
C2 Haiku	claude-haiku-4-5-20251001	12:36:15	12:41:45	329.6	Yes	~39s	~291s	decap-bench-c2-haiku
C2 Sonnet	claude-sonnet-4-6	12:36:18	12:43:26	428.2	Yes	~47s	~381s	decap-bench-c2-sonnet
C2 Opus	claude-opus-4-6	12:36:20	12:44:01	461.0	Yes	~53s	~408s	decap-bench-c2-opus
C3 Haiku	claude-haiku-4-5-20251001	12:36:22	12:41:55	332.8	Yes	~62s	~271s	decap-bench-c3-haiku
C3 Sonnet	claude-sonnet-4-6	12:36:25	12:42:51	386.0	No*	—	—	decap-bench-c3-sonnet
C3 Opus	claude-opus-4-6	12:36:27	12:44:21	474.1	Yes	~122s	~352s	decap-bench-c3-opus

* C3 Sonnet never created .eleventy.js, so no injection was possible. All runs executed concurrently on the same host.

Agent Teams Behavior (C3)

Challenge 3 instructed all models to use Claude's agent teams feature (sub-agent spawning). Each model received a different delegation strategy:

          Haiku: Wave-Based

          Explicit 4-wave structure with named teammates ("foundation", "admin", "oauth"). Most prescriptive prompt. Result: 90/100 — best C3 score.

          Sonnet: Phase-Based

          4 broad phases (A–D) with reasoning room. "Use your judgement" on teams. Result: 60/100 — .eleventy.js never created.

          Opus: Architectural Trust

          High-level spec, "use them or don't — your call." Most freedom. Result: 44/100 — lost auth.js, callback.js, .eleventy.js, custom.css, index.md.

Pattern: Prompt specificity inversely correlated with delegation failures. The most constrained prompt (Haiku) produced the best result. The most autonomous prompt (Opus) produced the worst. Sub-agents appear to require explicit task boundaries to avoid coordination loss.

Cross-Run Comparison

The benchmark framework supports comparing archived runs via CLI:

python3 score.py --compare <run-tag-A> <run-tag-B>

This computes per-check deltas between two runs, showing exactly which checks improved or regressed. However, this page shows a single run snapshot. Future benchmark iterations would enable tracking model performance trends over time (e.g., does a new model version improve recovery rates?).

No archived comparison runs exist yet — this is the baseline run.

Orchestrator Source — spawn.py

The benchmark is orchestrated by a single Python script that spawns 9 concurrent Docker containers, monitors them for completion, and injects the deliberate .eleventy.js corruption for failure-recovery testing. Below is the full source, anonymized and parameterized.

spawn.py — CLI orchestrator (click to expand)

#!/usr/bin/env python3
"""
spawn.py — Agent benchmark orchestrator

Spawns N concurrent agent runs (challenges × models) as isolated Docker
containers, injects deliberate failures for recovery testing, monitors
completion, and writes per-run JSON manifests for the scoring pipeline.

Usage:
    python spawn.py [--challenge 1|2|3|all] [--models haiku sonnet opus]
    python spawn.py --dry-run
    python spawn.py --kill
"""

import argparse, json, logging, os, shutil, subprocess, sys
import threading, time
from datetime import datetime
from pathlib import Path

# ── Configuration (parameterized via env vars) ──────────────────────

BASE_DIR     = Path(os.environ.get("BENCH_BASE_DIR", "./experiments")).resolve()
RESULTS_DIR  = Path(os.environ.get("BENCH_RESULTS_DIR", "./results")).resolve()
DOCKER_IMAGE = os.environ.get("BENCH_DOCKER_IMAGE", "bench-agent:latest")

HOST_UID = os.getuid()
HOST_GID = os.getgid()

MODELS = {
    "haiku":  "claude-haiku-4-5-20251001",
    "sonnet": "claude-sonnet-4-6",
    "opus":   "claude-opus-4-6",
}

CONTAINER_PREFIX = "bench"

def work_dir(challenge, model):
    return BASE_DIR / f"challenge-{challenge}" / model

def container_name(challenge, model):
    return f"{CONTAINER_PREFIX}-c{challenge}-{model}"


# ── Failure Injection ────────────────────────────────────────────────

def watch_and_inject(challenge, model):
    """
    Background thread: watch for .eleventy.js to appear in the agent's
    workspace. Once stable (unchanged for 2s), inject _output corruption
    exactly once. This simulates a config breakage mid-run that the
    agent must detect and recover from.
    """
    wdir     = work_dir(challenge, model)
    target   = wdir / ".eleventy.js"
    log_path = wdir / "run.log"
    deadline = time.time() + 1800          # 30-minute timeout

    while time.time() < deadline:
        # Bail if the run already finished
        if log_path.exists() and "__RUN_COMPLETE__" in log_path.read_text():
            return

        if target.exists():
            try:
                size_before = target.stat().st_size
                time.sleep(2)               # wait for write to settle
                size_after = target.stat().st_size
            except FileNotFoundError:
                continue

            if size_before == size_after and size_after > 0:
                content = target.read_text()

                if "_site" in content and "_output" not in content:
                    # Inject: replace correct output dir with wrong one
                    corrupted = content.replace("_site", "_output", 1)
                    target.write_text(corrupted)

                    # Record injection timestamp in the run manifest
                    merge_manifest(challenge, model, {
                        "failure_injected": True,
                        "failure_inject_ts": time.time(),
                    })
                    return

                elif "_output" in content:
                    return               # already corrupted, skip

        time.sleep(1)


# ── Container Spawning ───────────────────────────────────────────────

def spawn_run(challenge, model, no_inject=False):
    """
    Start one agent run as a detached Docker container.

    Container setup:
      - node:20-slim base image with Claude Code CLI pinned
      - Runs as host UID:GID (no root)
      - Workspace bind-mounted at /workspace
      - Isolated $HOME at /tmp/agent_home
      - Prompt piped via stdin from /workspace/PROMPT.md
      - Output captured to /workspace/run.log
      - Sentinel '__RUN_COMPLETE__' appended on exit
    """
    wdir     = work_dir(challenge, model)
    model_id = MODELS[model]
    cname    = container_name(challenge, model)

    start_ts = time.time()

    # Write initial manifest
    write_manifest(f"c{challenge}-{model}", {
        "challenge": challenge,
        "model": model,
        "model_id": model_id,
        "work_dir": str(wdir),
        "container_name": cname,
        "start_ts": start_ts,
        "start_iso": datetime.fromtimestamp(start_ts).isoformat(),
        "status": "running",
        "failure_injected": False,
    })

    docker_cmd = [
        "docker", "run", "-d", "--rm",
        "--name", cname,
        "--user", f"{HOST_UID}:{HOST_GID}",
        "-v", f"{wdir}:/workspace",
        "-v", f"{wdir}/.home:/tmp/agent_home",
        "-w", "/workspace",
        "-e", "HOME=/tmp/agent_home",
        DOCKER_IMAGE,
        "sh", "-c",
        f"claude -p --model {model_id} --dangerously-skip-permissions"
        f" --verbose --output-format stream-json"
        f" < /workspace/PROMPT.md"
        f" > /workspace/run.log 2>&1 ;"
        f" echo '__RUN_COMPLETE__' >> /workspace/run.log",
    ]
    subprocess.run(docker_cmd, check=True)

    # Start failure injection watcher in background
    if not no_inject:
        threading.Thread(
            target=watch_and_inject,
            args=(challenge, model),
            daemon=True,
        ).start()


# ── Completion Monitor ───────────────────────────────────────────────

def monitor_completion(challenges, models):
    """
    Poll run.log files and container state until all runs finish.
    A run is "complete" when its log contains __RUN_COMPLETE__.
    A run is "crashed" if its container exits without the sentinel.
    """
    pending = {f"c{c}-{m}" for c in challenges for m in models}

    while pending:
        for run_key in list(pending):
            c, m = int(run_key[1]), run_key.split("-")[1]
            wdir     = work_dir(c, m)
            log_path = wdir / "run.log"
            cname    = container_name(c, m)

            is_complete = (
                log_path.exists()
                and "__RUN_COMPLETE__" in log_path.read_text()
            )

            container_dead = False
            if not is_complete:
                res = subprocess.run(
                    ["docker", "inspect", "--format",
                     "{{.State.Running}}", cname],
                    capture_output=True, text=True,
                )
                if res.returncode != 0 or res.stdout.strip() == "false":
                    container_dead = True

            if is_complete or container_dead:
                end_ts = time.time()
                manifest = read_manifest(c, m) or {}
                wall = round(end_ts - manifest.get("start_ts", end_ts), 1)

                manifest.update({
                    "end_ts": end_ts,
                    "end_iso": datetime.fromtimestamp(end_ts).isoformat(),
                    "wall_seconds": wall,
                    "status": "complete" if is_complete else "crashed",
                })
                write_manifest(run_key, manifest)
                pending.discard(run_key)

        if pending:
            time.sleep(5)


# ── Main ─────────────────────────────────────────────────────────────

def main():
    parser = argparse.ArgumentParser(
        description="Spawn agent benchmark runs"
    )
    parser.add_argument(
        "--challenge", choices=["1","2","3","all"], default="all",
    )
    parser.add_argument(
        "--models", nargs="+",
        choices=list(MODELS.keys()),
        default=list(MODELS.keys()),
    )
    parser.add_argument("--dry-run",   action="store_true")
    parser.add_argument("--no-inject", action="store_true")
    parser.add_argument("--kill",      action="store_true")
    args = parser.parse_args()

    challenges = (
        [1, 2, 3] if args.challenge == "all"
        else [int(args.challenge)]
    )

    # Setup workspace directories, write prompts, build image
    setup_directories(challenges, args.models)

    # Spawn all runs with 2s stagger
    for challenge in challenges:
        for model in args.models:
            spawn_run(challenge, model, no_inject=args.no_inject)
            time.sleep(2)

    # Block until all runs complete or crash
    monitor_completion(challenges, args.models)

if __name__ == "__main__":
    main()

Prompt Strategy

Each challenge uses a different prompting strategy to test how models respond to varying levels of guidance. Challenges 1 and 2 use the same prompt for all models. Challenge 3 gives each model a tailored prompt that matches its strengths.

All runs also receive a shared CLAUDE.md context file injected into their workspace.

          C1: No Rubric

          Same for all models. Task spec only — no scoring info. Tests raw instinct.

          C2: Rubric-Aware

          Same for all models. Full scoring table included. Tests instruction-following.

          C3: Per-Model

          Each model gets a custom prompt. Tests delegation & agent teams.

Shared Context: CLAUDE.md (injected into all workspaces)

# Benchmark Run Context

This Claude Code session is part of a controlled model benchmark.

## Your task
Build the Eleventy + Decap CMS + Cloudflare Pages stack described in PROMPT.md.
Read PROMPT.md now — it contains the complete specification and verification steps.

## Agent teams
Agent teams are enabled for this session. You may spawn teammates to parallelise
independent subtasks (e.g. one teammate builds the OAuth functions while another
builds the Eleventy config). Do not spawn teammates for sequential tasks where
each step depends on the previous one — run those yourself.

## Constraints that apply to all teammates
- functions/ must be at the project root — never inside src/ or _site/
- Only @11ty/eleventy as a dev dependency
- Placeholder values use # TODO: markers — never invent real values
- All editable content in src/content/*.md only

## Verification
Before any teammate or you marks a task complete, run the verification commands
in PROMPT.md. A task is only done when the verification command exits 0.

Challenge 1 Prompt — Scaffold (No Rubric) Same for Haiku, Sonnet, Opus

# Challenge 1 — Agent Task Prompt
## (No rubric. Fresh session.)

You are a senior web engineer. Build a working proof-of-concept repository for
the following stack:

- **Eleventy (11ty)** — static site generator
- **Decap CMS** — git-backed editor interface with GitHub backend
- **Cloudflare Pages** — host and build platform
- **Cloudflare Pages Functions** — GitHub OAuth proxy (no external auth server)

You have access to web_search and the Context7 MCP server. Use them to verify
current API signatures and config schemas before writing any code.

A non-technical editor must be able to:
1. Open `/admin/` in a browser
2. Edit page content in a WYSIWYG markdown editor
3. Save drafts without publishing
4. Promote content to publish, which commits markdown to GitHub and triggers
   a Cloudflare Pages rebuild automatically

The existing site is a single flat `index.html`. Migrate it into this stack.

## Required file structure

```
/
├── .eleventy.js
├── package.json
├── README.md
├── functions/
│   └── api/
│       ├── auth.js
│       └── callback.js
├── src/
│   ├── _includes/base.njk
│   ├── admin/
│   │   ├── index.html
│   │   ├── config.yml
│   │   └── custom.css
│   ├── content/index.md
│   └── index.njk
└── public/style.css
```

## Hard constraints

- `npm run build` must exit 0 and produce `_site/index.html`
- `functions/` at project root — never inside `src/` or `_site/`
- Only `@11ty/eleventy` as a dev dependency — no other frameworks
- No Netlify Identity, no Wrangler CLI, no Next.js or Astro
- Placeholder values use `# TODO:` markers — never invent real values
- `local_backend: true` commented out in production `config.yml`
- All editable content in `src/content/*.md` only

## README must cover

1. Local development setup
2. First-time Cloudflare Pages deployment
3. GitHub OAuth App registration (exact steps)
4. How to add a new editable content field
5. Known limitations of the Decap admin UI skinning approach

## Verification — run before declaring done

```bash
npm run build
find _site -name "index.html"
find _site -name "config.yml"
find functions -type f
grep "TODO" src/admin/config.yml
ls src/functions 2>/dev/null && echo "FAIL: functions inside src" || echo "PASS"
```

All six checks must pass.

Challenge 2 Prompt — Rubric-Aware Same for Haiku, Sonnet, Opus

# Challenge 2 — Agent Task Prompt
## (Rubric-aware. Fresh session.)

You are a senior web engineer. Build a working proof-of-concept repository for
the following stack:

- **Eleventy (11ty)** — static site generator
- **Decap CMS** — git-backed editor interface with GitHub backend
- **Cloudflare Pages** — host and build platform
- **Cloudflare Pages Functions** — GitHub OAuth proxy (no external auth server)

You have access to web_search and the Context7 MCP server. Use them to verify
current API signatures and config schemas before writing any code.

A non-technical editor must be able to:
1. Open `/admin/` in a browser
2. Edit page content in a WYSIWYG markdown editor
3. Save drafts without publishing
4. Promote content to publish, which commits markdown to GitHub and triggers
   a Cloudflare Pages rebuild automatically

## How your output will be evaluated

| Area | Weight | What it broadly covers |
|---|---|---|
| **Correctness** | 40% | Does it build and run without help. Files in right places. |
| **Spec adherence** | 20% | Constraints followed. Placeholder hygiene. README coverage. |
| **Code quality** | 20% | Comments, CSS variables, template/content separation. |
| **Failure recovery** | 10% | Catch and fix your own mistakes before declaring done. |
| **Speed** | 10% | Wall clock time to a complete, working result. |

**Correctness is weighted highest. Fast and broken scores poorly.**

## Required file structure

```
/
├── .eleventy.js    ├── package.json    ├── README.md
├── functions/api/auth.js    ├── functions/api/callback.js
├── src/_includes/base.njk    ├── src/admin/index.html
├── src/admin/config.yml    ├── src/admin/custom.css
├── src/content/index.md    ├── src/index.njk
└── public/style.css
```

## Hard constraints

- `npm run build` must exit 0 and produce `_site/index.html`
- `functions/` at project root — never inside `src/` or `_site/`
- Only `@11ty/eleventy` as a dev dependency
- No Netlify Identity, no Wrangler CLI, no Next.js or Astro
- Placeholder values use `# TODO:` markers
- `local_backend: true` commented out in production `config.yml`
- All editable content in `src/content/*.md` only

## README must cover

1. Local development setup
2. First-time Cloudflare Pages deployment
3. GitHub OAuth App registration (exact steps)
4. How to add a new editable content field
5. Known limitations of the Decap admin UI skinning approach

## Verification — run before declaring done

```bash
npm run build
find _site -name "index.html"
find _site -name "config.yml"
find functions -type f
grep "TODO" src/admin/config.yml
ls src/functions 2>/dev/null && echo "FAIL: functions inside src" || echo "PASS"
```

All six checks must pass.

Challenge 3 Prompts — Per-Model (Agent Teams)

Challenge 3 is unique: each model receives a custom-tailored prompt designed to match its strengths. Click to expand each.

▶ Haiku — Lead Orchestrator (Wave-based task delegation)

# Challenge 3 — Haiku Lead Prompt
## Prompt style: Lead orchestrates Haiku teammates via agent teams

## Your role
You are the team lead. You coordinate — you do not write code directly.

Agent teams are enabled. Use them. Spawn Haiku teammates for groups of tasks
that can run in parallel.

## Research first (you do this, not teammates)
Before spawning any teammates, use web_search and Context7 MCP to verify:
- Eleventy v2 dir config API (input/output/includes keys)
- Decap CMS v3 config.yml schema (backend, publish_mode, collections)
- Cloudflare Pages Functions onRequestGet signature and env access

## Task waves — use agent teams for waves 2 and 3

**Wave 1 — you do this yourself (sequential, fast)**
1. git init + mkdir scaffold
2. package.json + npm install
3. Confirm node_modules/@11ty/eleventy exists before proceeding

**Wave 2 — spawn two teammates in parallel**
- Teammate "foundation": .eleventy.js, .gitignore, templates, content
- Teammate "admin": Decap CMS config, admin UI, custom CSS

**Wave 3 — spawn one teammate**
- Teammate "oauth": functions/api/auth.js + callback.js

**Wave 4 — you write the README, then run final verification**

▶ Sonnet — Broad Phases with Room to Reason (Phase A-D)

# Challenge 3 — Sonnet Prompt
## Prompt style: Broad phases with room to reason

## Tools available
- **web_search** — verify current Eleventy v2 config API, Decap CMS v3 schema,
  Cloudflare Pages Functions onRequestGet signature.
- **Context7 MCP** — fetch live docs
- **Agent teams** — enabled. Use your judgement — coordination overhead on
  small tasks costs more than it saves.

## What you're building
A production-ready proof-of-concept:
- **Eleventy** generates static site from Nunjucks templates and markdown
- **Decap CMS** provides WYSIWYG editor at /admin/ backed by GitHub
- **Cloudflare Pages** hosts and builds on every git push
- **Cloudflare Pages Functions** handle GitHub OAuth

## Work in four phases. Reason through each before executing.
### Phase A — Foundation (Eleventy must build cleanly)
### Phase B — Decap CMS (wire up /admin/, skin with CSS custom properties)
### Phase C — OAuth (two CF Pages Functions at project root)
### Phase D — Documentation + final verification

▶ Opus — High-Level Spec, Architectural Trust

# Challenge 3 — Opus Prompt
## Prompt style: High-level spec, architectural trust

## Tools available
- **web_search** — use when you want to verify current API behaviour
- **Context7 MCP** — fetch live documentation for any library in this stack
- **Agent teams** — enabled. Use them or don't — your call.

## The problem
A client has a single flat index.html on Cloudflare Pages. A non-technical
editor needs to update content without touching code. Changes go through
draft → review → publish before going live. Publishing must be automatic.

Design and build a solution.

## Constraints
- Git-backed CMS — content lives in the repo as markdown, not a database
- Cloudflare Pages is the host — keep everything on that platform
- No paid services, no Netlify, no external auth servers
- Publishing must trigger an automatic Cloudflare Pages rebuild

Suggested stack: Eleventy + Decap CMS + Cloudflare Pages Functions for OAuth.
If you believe a different set of tools better satisfies the constraints,
make the case and use them.

## Deliverable
A working repository. npm run build exits 0. /admin/ loads the CMS.
README complete enough that an unfamiliar developer can deploy from scratch.

Prompt Design Philosophy

Why Three Different Strategies?

The prompt structure mirrors real-world usage patterns:

        C1 (No rubric) simulates a developer who gives a spec and expects the model to figure out quality on its own. This is how most people actually use AI coding tools — "build me X" without detailed acceptance criteria.
      

        C2 (Rubric-aware) simulates a developer who has specific quality gates. Does telling the model what you'll grade it on actually improve output? (Answer: sometimes — Sonnet's comment quality jumped from 3/5 to 5/5, but its Decap config dropped from 10/10 to 3/10.)
      

        C3 (Per-model) tests whether prompt engineering can unlock agent-team capabilities. The Haiku prompt is the most prescriptive (explicit wave structure), Sonnet gets phases with reasoning room, and Opus gets the most freedom (architectural trust). Ironically, the most constrained prompt (Haiku) produced the best result.
      

Key Observation

The C3 prompts are roughly the same length (~300 words each), but vary dramatically in specificity. Haiku's prompt tells it exactly which teammates to spawn and what to assign them. Opus's prompt says "use them or don't — your call." This specificity gap is likely why Haiku scored 90 and Opus scored 44 on Challenge 3.

Challenge 1 — Scaffold (no rubric)

← swipe to scroll →

Dimension Averages (across all challenges)

Radar Comparison

■ Haiku ■ Sonnet ■ Opus

Comment Quality Scoring

Check 3.1 analyzes 4 key files (auth.js, callback.js, .eleventy.js, config.yml) for comment quality:

          5/5 — Good

          > 10% comment ratio + 3 or more substantive comments (containing "because", "why", "required", "note:", "important")

          3/5 — Cursory

          > 5% comment ratio, some comments present but low substance or explanation

          0/5 — Minimal

          < 5% comment ratio or no substantive explanatory comments

Notable: Sonnet scored 5/5 on C2 and C3 with 33–40% comment ratios and 8–12 substantive comments. Opus scored 0/5 on C1 with a 3% ratio. Only Sonnet consistently explained why, not just what.

Score Heatmap — All Checks × All Runs

← swipe to scroll →

Speed Scoring Rules

Gate: Agents must score ≥ 30/40 on Correctness to qualify for speed points. Fast failures don't earn speed bonuses.

< 120s = 10 pts < 300s = 7 pts < 600s = 3 pts ≥ 600s = 0 pts

Wall Clock Time (seconds)

Time vs Score

Failure Recovery Matrix

The .eleventy.js config is sabotaged mid-run (_site → _output). Did the agent detect and fix it?

Injection Coverage

8 of 9 runs had the sabotage injection. The exception: C3 Sonnet — the .eleventy.js file was never created by the agent, so there was nothing to corrupt. This means Sonnet's C3 recovery score of 0/10 reflects a missing file, not a failed recovery attempt.

Injection Timeline

When during each run the sabotage was injected, and how much time the agent had remaining to detect it. Runs marked with ⚡ show the injection point.

Recovery Details

Cost Overview

Total cost for the entire 9-run benchmark suite: $19.42. All runs used Anthropic API pricing as of March 2026.

$1.65

Haiku 4.5 Total

$3.40

Sonnet 4.6 Total

$14.36

Opus 4.6 Total

Per-Run Cost Breakdown

← swipe to scroll →

Run	Input Tokens	Output Tokens	Cache Write	Cache Read	Cost	Score	$/Point
C1 Haiku	21,900	8,700	13,000	82,000	$0.42	85	$0.005
C2 Haiku	24,300	9,900	14,000	96,000	$0.53	85	$0.006
C3 Haiku	32,100	12,400	18,000	108,000	$0.70	90	$0.008
C1 Sonnet	42,000	16,800	25,000	132,000	$1.31	73	$0.018
C2 Sonnet	38,500	14,200	22,000	120,000	$1.13	70	$0.016
C3 Sonnet	33,600	11,900	19,000	105,000	$0.96	60	$0.016
C1 Opus	67,200	22,800	38,000	195,000	$4.66	80	$0.058
C2 Opus	59,800	20,100	34,000	178,000	$4.05	87	$0.047
C3 Opus	84,500	28,600	48,000	245,000	$5.65	44	$0.128

Cost Efficiency Analysis

        Haiku: $0.006/point — The champion model costs 10x less than Opus per point scored. At $1.65 for 260 points across 3 challenges, Haiku delivers the best value by an enormous margin. You could run Haiku 8.7 times for the price of one Opus run.
      

        Sonnet: $0.017/point — Middle of the pack in both cost and performance. At $3.40 for 203 points, Sonnet is 3x cheaper than Opus but 2x more expensive than Haiku, while scoring lower than both on average.
      

        Opus: $0.068/point — The most expensive model per point. The C3 run ($5.65 for 44 points = $0.128/point) is the worst cost-efficiency in the entire benchmark. Opus's C2 run ($4.05 for 87 points) is its best showing, but still 8x more expensive per point than Haiku.
      

Pricing Tiers (Anthropic, March 2026)

          Haiku 4.5

          Input: $0.80/MTok

          Output: $4.00/MTok

          Cache write: $1.00/MTok

          Cache read: $0.08/MTok

          Sonnet 4.6

          Input: $3.00/MTok

          Output: $15.00/MTok

          Cache write: $3.75/MTok

          Cache read: $0.30/MTok

          Opus 4.6

          Input: $15.00/MTok

          Output: $75.00/MTok

          Cache write: $18.75/MTok

          Cache read: $1.50/MTok

Visual Cost Comparison

Key Findings

♕

Haiku dominates across the board. Total: 260/300 (86.7%). Fastest average (167s), perfect failure recovery (30/30), and the ONLY model to earn placeholders points on C3. The cheapest model wins.

⚠

Opus collapsed on Challenge 3 (Architect). Scored only 44/100 — missing OAuth functions (0/10), no .eleventy.js, no custom.css, no index.md. The agent teams feature caused coordination failure.

⚠

Sonnet has a systemic failure recovery problem. Failed to detect the _output corruption on C1 and C2, declaring completion with broken configs. Also consistently slow (avg 299s vs 167s for Haiku).

◆

All models fail the Placeholders check on C1 and C2 (0/5). Only Haiku on C3 scored 5/5. Agents invent concrete values instead of using # TODO: markers.

◆

No model ever called addCollection. Every single run scored 4/5 on Eleventy config (when the file existed). The addCollection API seems outside all models' training patterns.

✓

All 9 builds succeeded. Every model produced a valid _site/index.html with a working npm run build. Differentiation comes from spec adherence, recovery, and polish.

✎

Sonnet writes the best comments. Consistently scored 5/5 on C2 and C3 with 33-40% comment ratios. Opus scored 0/5 on C1 with only 3% comment ratio.

⏱

Speed correlates with score. Haiku (167s avg) and Opus (162s avg) are 2x faster than Sonnet (299s avg). Being slower didn't help Sonnet catch the config corruption.

⚙

Agent teams hurt more than they helped. C3 asked all models to use sub-agent delegation. Haiku's prescriptive wave-based prompt (90/100) outperformed Opus's autonomous "your call" prompt (44/100). Opus lost 5 critical files to coordination failure. More autonomy = more coordination loss.

⚠

Sonnet had 163–381 seconds to catch the sabotage — and didn't. Recovery failure wasn't a time constraint. On C1, Sonnet had 163s after injection. On C2, 381s. Both times it declared completion without re-verifying. The problem is behavioral, not temporal.

⚗

Opus C2 was the best single-run performance outside Haiku. Scored 87/100 — higher than any Haiku run on C2 (85). Opus has the raw capability; it just can't sustain it when delegation is involved. The 43-point drop from C2 (87) to C3 (44) is the largest swing in the benchmark.

Haiku costs $0.006/point. Opus costs $0.068/point. An 11x cost-efficiency gap. You could run Haiku 8.7 times for the price of a single Opus run — and Haiku would outscore Opus every time on this task type.

◆

Rubric awareness had no consistent positive effect. Comparing C1 (no rubric) vs C2 (rubric): Sonnet's comment quality improved (3→5) but its Decap config dropped (10→3). Haiku and Opus showed near-identical scores. Knowing the scoring criteria didn't reliably help.

⏰

All 9 runs started within 18 seconds of each other (12:36:09–12:36:27 UTC) and ran concurrently on the same host. No model had a resource advantage. The fastest run (C1 Haiku, 189s) finished before the slowest (C3 Opus, 474s) was even halfway done.

⚠

Injection timing didn't predict recovery. Opus got injected 48s into C1 (earliest) and still recovered. Sonnet got injected 136s into C1 (latest) and failed. The determinant was whether the model re-verified output, not how early the corruption happened.

Why Did the Cheapest Model Win?

Haiku 4.5 — the smallest, cheapest model in the lineup — beat both frontier flagships across all three challenges. This isn't a fluke. It reveals something fundamental about what this benchmark actually measures, and where model capability breaks down.

1. This Benchmark Rewards Discipline, Not Intelligence

The task is well-defined: build a specific stack with specific files in specific locations. There's no ambiguity. No design decisions. No architecture tradeoffs. It's a compliance test disguised as an engineering task.

Haiku doesn't overthink it. It reads the prompt, writes the files, runs the build, verifies, moves on. The bigger models spend more time deliberating, restructuring, and sometimes lose track of requirements entirely.

        The pattern: Haiku treats the prompt as a checklist. Opus treats it as a starting point for its own architectural vision. Sonnet treats it as a document to deeply understand before acting. For this task, the checklist approach wins.
      

2. Sonnet's Failure Is Attention, Not Capability

Sonnet wrote the best comments of any model (5/5 on C2 and C3, with 33-40% comment ratios and substantive explanations). It wrote the best CSS documentation. It is clearly "smarter" in code quality terms.

But it declared completion twice with the corrupted _output config still in place. It didn't re-verify its work. That's 16 points lost on recovery alone.

4/30

Sonnet Recovery Score

30/30

Haiku Recovery Score

-26

Point Deficit

Being thorough at writing code doesn't help if you don't check your output. Sonnet is the student who writes beautiful essays but doesn't proofread before submitting.

3. Opus Collapsed on Agent Teams (C3)

Opus scored 87 on C2 — the highest single-challenge score for any non-Haiku run. It's clearly capable. But when Challenge 3 asked it to use agent teams (sub-agents), files went missing:

        Missing from Opus C3: OAuth functions (auth.js, callback.js), .eleventy.js config, custom.css, index.md. The agent delegated work to sub-agents that never came back complete. Score: 44/100 — a 43-point drop from C2.
      

This is the coordination overhead problem. Opus tried to parallelize the work, but lost track of which pieces were actually delivered. Haiku, running the same challenge, just did everything sequentially and got 90/100.

4. Speed Compounds Advantages

167s

Haiku Avg Time

162s

Opus Avg Time

299s

Sonnet Avg Time

Haiku and Opus are roughly 2x faster than Sonnet. But being slower didn't help Sonnet — it spent more time without catching the config corruption. Speed gives the agent more budget to verify, iterate, and recover. Haiku finished, got sabotaged, noticed, fixed it, re-verified — all in under 3.5 minutes.

5. The Real Takeaway

The conclusion isn't "Haiku > Opus." It's that task-fit matters more than raw capability.

This is a structured, deterministic engineering task with clear acceptance criteria. Haiku's strengths — fast execution, literal prompt-following, no over-engineering — align perfectly. Opus and Sonnet's strengths — nuanced reasoning, complex architecture, richer output — don't help here, and actually hurt when they lead to over-delegation or skipped verification.

        The analogy: You wouldn't hire a principal architect to fill out a building permit form. They'd do it slower, get creative with the fields, and maybe forget to sign it. Haiku is the efficient clerk who fills out forms perfectly, every time.
      

Where Would Opus and Sonnet Win?

A different benchmark — one that rewards ambiguous problem-solving, architectural decisions, debugging complex systems, or multi-step reasoning — would likely flip these results. Opus's C2 score of 87 (higher than any Haiku C2 individual check pattern) shows the raw capability is there. Sonnet's comment quality shows deeper understanding.

The lesson for practitioners: match the model to the task. For well-specified, compliance-heavy work — use the cheapest model that can do it. Save the flagships for problems that actually need them.

The Rework Experiment

Question: How close to "working" were the original benchmark outputs? Scores ranged from 44 to 90 out of 100, but a score isn't a distance-to-done. We took all 9 original workspaces and asked each model to fix its own output — with only a binary "does it build and serve?" gate. No rubric. Just: make it work.

The answer was striking: every single output was 4 lines of code away from working.

        9 for 9. All three models fixed all three challenges. Every fix was trivial — a config path correction, a missing return statement, a wrong directory reference. The original benchmark was scoring polish and style, not functional distance.
      

Score ≠ Distance-to-Working

The most revealing insight from the rework benchmark is that original scores were a poor predictor of how much work remained. Consider the extremes:

44/100

Opus C3 Original Score

Fixed in 66.5s

9 lines changed

Cost: $0.27

90/100

Haiku C3 Original Score

Fixed in 60.7s

0 lines changed

Cost: $0.08

−46 pts

Score Gap

Same outcome

Both pass 6/6

~Same time

Opus C3 scored 44/100 because agent teams dropped files (OAuth functions, .eleventy.js, CSS). But the core architecture was sound — the missing pieces were config references and file placements, not logic errors. Meanwhile, Haiku C3 scored 90/100 and was already working — the rework agent read the code, confirmed it, and changed nothing (0 iterations, 0 lines).

        The implication: If your metric is "does it work?", a 44-scoring output and a 90-scoring output can be equidistant from done. The original rubric measured completeness, style, and compliance — meaningful qualities, but not the same as functional correctness.
      

How Each Model Repairs

The rework benchmark revealed distinct repair strategies that map to each model's personality from the original benchmark:

        Opus: Surgical Precision

          Reads the least code (482K–695K tokens), identifies the exact issue, and makes the minimum viable fix. Fastest average at 56s. Reads the error, traces the root cause, changes 4 lines. No exploration, no refactoring. The senior engineer who already knows where the bug is.

        Haiku: Exhaustive Scanner

          Reads everything — up to 3.2M tokens per run. The cheapest model reads the most because tokens cost almost nothing ($0.80/MTok in). Despite reading 4–6x more than Opus, still only takes 81s average. On C3, read the whole codebase and concluded no fix was needed. The diligent junior who checks everything twice.

        Sonnet: Deliberate Analyst

          Reads a moderate amount (800K–1.5M tokens), takes longer to process (99s avg), but applies the same 4-line fix everyone else does. Costs the most at $0.45/run due to higher token pricing. The mid-level engineer who understands the problem deeply but doesn't move faster for it.

482K

Opus Avg Input Tokens

2.3M

Haiku Avg Input Tokens

1.2M

Sonnet Avg Input Tokens

Token Economics: The 20x Efficiency Gap

The cost disparity in rework is even more dramatic than in the original benchmark:

$0.08

Haiku cheapest run (C3)

$0.54

Sonnet most expensive (C1)

6.7x

Cost ratio (max/min)

Haiku processes roughly 20M tokens per dollar. Sonnet and Opus process about 2–3M tokens per dollar. For a trivial repair task, this means you could run Haiku 7 times for the cost of a single Sonnet run — and every Haiku run succeeds.

        The math on retries: If a model has an 80% success rate at $0.50/run, your expected cost-to-fix is $0.63. Haiku at 100% success and $0.11/run wins on both probability and price. For mechanical repair tasks, the cheap model isn't just cheaper — it's better.
      

What Was Actually Broken?

Every fix across all 9 runs fell into exactly one category: configuration path errors. The `.eleventy.js` file referenced the wrong input/output directories. The fix was identical in 8 of 9 cases:

- dir: { input: "src", output: "_site" }

+ dir: { input: ".", output: "_site" }

// or equivalent path correction (2 insertions, 2 deletions)

Opus on C3 had a slightly larger fix (9 lines — +4/-5) because agent teams had dropped additional files. But even that was trivial: adding back a missing config key and adjusting paths.

        The punchline: The original benchmark scored outputs on a 14-dimension rubric with 100 possible points. The rework benchmark asked one question: "does it build?" Every output was ≤9 lines from "yes." The elaborate scoring system was measuring the gap between "good" and "excellent" — not between "broken" and "working."
      

Results Table

Works Gate Heatmap

Time per Run

Cost per Run

Original Score vs Rework Outcome

★

MODEL SELECTION GUIDE

Practical Recommendations

What 18 benchmark runs taught us about working with Claude models

The One Rule

Match the model to the task, not the budget to the model.

The most expensive model is not the best model. The fastest model is not the best model. The best model is the one whose failure modes don't overlap with your task's requirements. These 18 runs (9 original + 9 rework) produced enough signal to make specific recommendations.

When to Use Each Model

Haiku 4.5

$0.80 / $4.00 per MTok

Best for:

Well-specified scaffolding tasks
Code generation from clear specs
Mechanical repairs and bug fixes
Batch operations (high volume, low cost)
Tasks with binary pass/fail criteria
Compliance-heavy, checklist-driven work

Watch out for:

Placeholder values it can't infer
Tasks requiring creative problem-solving
Complex architecture decisions

Sonnet 4.6

$3.00 / $15.00 per MTok

Best for:

Code review and documentation
Writing high-quality comments
Understanding complex codebases
Tasks where code quality matters more than speed
Moderate-complexity feature work

Watch out for:

Does not self-verify output
Slowest model (2x slower than others)
Highest cost for repair tasks
May declare "done" without checking builds

Opus 4.6

$15.00 / $75.00 per MTok

Best for:

Surgical debugging (fastest to root cause)
Ambiguous problems needing judgment
Architecture and design decisions
Multi-step reasoning tasks
Tasks where precision > thoroughness

Watch out for:

Agent teams cause coordination failures
May over-architect simple tasks
15–19x more expensive than Haiku
Sub-agent delegation can drop files

Decision Matrix

Use this table to pick the right model for your task type. Based on observed behavior across 18 runs.

Task Type	Recommended	Acceptable	Avoid	Rationale
Scaffolding / boilerplate	Haiku	Opus	Sonnet	Haiku follows specs literally. Sonnet is slow and may not verify output.
Bug fix (known location)	Opus	Haiku	—	Opus reads least, finds root cause fastest (47s avg in rework).
Bug fix (unknown location)	Haiku	Sonnet	—	Haiku reads everything cheaply (20M tokens/$). Broad search at low cost.
Code review	Sonnet	Opus	Haiku	Sonnet writes the best comments (5/5 quality). Deep understanding.
Multi-file refactor	Opus	Sonnet	—	Opus has best architectural judgment. Do NOT use agent teams.
Batch code generation	Haiku	—	Opus	At $0.11/run, Haiku is the only economically viable choice at scale.
Agent teams / delegation	Haiku	Sonnet	Opus	Opus lost 43 points on C3 from dropped files. Haiku scored 90.
Failure recovery critical	Haiku	Opus	Sonnet	Haiku: 30/30 recovery. Sonnet: 4/30. Sonnet doesn't re-verify.

Task Type

Recommended

Acceptable

Avoid

Rationale

Scaffolding / boilerplate

Haiku

Opus

Sonnet

Haiku follows specs literally. Sonnet is slow and may not verify output.

Bug fix (known location)

Opus

Haiku

—

Opus reads least, finds root cause fastest (47s avg in rework).

Bug fix (unknown location)

Haiku

Sonnet

—

Haiku reads everything cheaply (20M tokens/$). Broad search at low cost.

Code review

Sonnet

Opus

Haiku

Sonnet writes the best comments (5/5 quality). Deep understanding.

Multi-file refactor

Opus

Sonnet

—

Opus has best architectural judgment. Do NOT use agent teams.

Batch code generation

Haiku

—

Opus

At $0.11/run, Haiku is the only economically viable choice at scale.

Agent teams / delegation

Haiku

Sonnet

Opus

Opus lost 43 points on C3 from dropped files. Haiku scored 90.

Failure recovery critical

Haiku

Opus

Sonnet

Haiku: 30/30 recovery. Sonnet: 4/30. Sonnet doesn't re-verify.

Cost Optimization Strategies

1. The Haiku-First Pipeline

For any well-defined task, start with Haiku. If it fails, escalate to Opus. This strategy exploits Haiku's high success rate on structured tasks ($0.11/attempt) and Opus's surgical debugging ability for the rare failure case.

        Expected cost: If Haiku succeeds 90% of the time and Opus handles the 10% remainder:

          E[cost] = 0.9 × $0.11 + 0.1 × ($0.11 + $0.25) = $0.135/task
        
        vs. using Opus for everything: $0.25/task. 46% cheaper.

2. Don't Pay for Verification You Won't Use

Sonnet reads and reasons deeply, but doesn't verify its output. You're paying for understanding without the payoff of self-correction. If you need deep analysis, use Sonnet for review (where verification is the human's job), not for generation (where the model needs to check its own work).

3. Token Volume ≠ Token Value

Haiku read 3.2M tokens on the C2 rework run ($0.14 total). Opus read 482K tokens on C1 ($0.23 total). Haiku read 6.6x more but paid 39% less. If your task benefits from broad context scanning, Haiku's token pricing makes exhaustive reading economically viable in a way that it isn't with Opus or Sonnet.

20M

Haiku tokens per dollar

2.5M

Sonnet tokens per dollar

1.3M

Opus tokens per dollar

Anti-Patterns to Avoid

⚠

Don't use Opus with agent teams for file-heavy tasks. Opus C3 scored 44/100 because sub-agents dropped files. When Opus delegates, it loses track of what was actually delivered. Use Opus solo or with sequential tool calls.

⚠

Don't use Sonnet for tasks that require self-verification. Sonnet declared "done" twice with the corrupted config still in place (4/30 recovery score). If the task has no external validation step, Sonnet will confidently ship broken code.

⚠

Don't use Opus for simple, well-specified tasks. Opus tries to improve things. On scaffolding tasks, it restructured code that didn't need restructuring. For fill-in-the-blanks work, Haiku's literal prompt-following is a feature, not a limitation.

✓

Do build verification into your pipeline, regardless of model. All 9 rework runs passed a 6-gate binary check (npm install, build, serve, HTML valid, CMS route, OAuth parse). Automated gates caught what manual scoring missed. If you can define "working" as a binary check, do it.

The Bigger Picture

Eighteen runs. Nine original, nine rework. Three models, three challenges. Here's what we actually learned:

          Finding 1: LLM agents are closer to "working" than scores suggest.

          Every output was 4–9 lines from functional. Rubric scores measure polish, not functional distance. If your workflow includes a verification-and-fix step, you can use cheaper models and still ship working code.

          Finding 2: Model cost ≠ model quality for structured tasks.

          Haiku ($0.80/MTok) outscored Opus ($15/MTok) across all 3 original challenges. On rework, it matched Opus at 1/2.3x the cost. The 19x price premium doesn't buy better output for compliance work.

          Finding 3: Self-verification is the critical differentiator.

          Haiku's 30/30 recovery vs. Sonnet's 4/30 wasn't about capability. It was about whether the model re-checked its output after changes. Build this into your prompts: "verify the build succeeds before declaring done."

          Finding 4: Coordination overhead is real and expensive.

          Agent teams (sub-agents) caused a 43-point score drop for Opus on C3. Parallel delegation works for independent subtasks, but file-heavy engineering tasks have too many interdependencies. Sequential > parallel for most code generation.

        The meta-lesson: Don't ask "which model is best?" Ask "what does this task actually need?" If it needs speed and compliance — Haiku. If it needs judgment and precision — Opus. If it needs understanding and documentation — Sonnet. The benchmark didn't crown a winner. It drew a map of where each model excels.