Terminal-Bench 2.0 · 89 tasks · open data

Breaking the model
cost continuum

Beating Sonnet with Haiku, and Opus with Sonnet.

Not with a bigger model — with a harness.

For the last two years, the cheapest way to improve a coding agent has been to pay more for the model. That isn't true anymore.

On Terminal-Bench 2.0, pre.dev + Haiku 4.5 (41.6%) outperforms Claude Code + Sonnet 4.5 (38.2%). pre.dev + Sonnet 4.6 (56.2%) outperforms Claude Code + Opus 4.5 (53.9%). In both cases, we drop a model tier and finish ahead.

The agent's architecture matters more than its weights.

the result

Beating Opus with Sonnet

pre.dev + Sonnet 4.6 vs Claude Code + Opus 4.5 — one tier down, +2.3 pts ahead
Terminal-Bench 2.0 · pass@1 · n=89

Beating Sonnet with Haiku

pre.dev + Haiku 4.5 outscores Claude Code at both the Haiku and Sonnet 4.5 tiers
Terminal-Bench 2.0 · pass@1 · n=89

Both systems ran on the harbor reference harness against the Terminal-Bench 2.0 task set (n=89, pass@1, identical task IDs). pre.dev numbers are our single-trial runs. Claude Code numbers are derived from CC's own public Terminal-Bench 2.0 submissions on tbench.ai: for each task we take the chronologically-first of CC's published trials and compute pass@1 from that single trial.

Mean across all of CC's published trials per task matches each published leaderboard accuracy to four decimal places. The scraper, cached pages, and parsed JSON are public.

the crossings
the crossing · pass@1
56.2%
pre.dev w/ Sonnet 4.6
beats
53.9%
Claude Code w/ Opus 4.5
+2.3pts at pass@1tier-skip
the haiku tier · pass@1
41.6%
pre.dev w/ Haiku 4.5
beats
27.0%
Claude Code w/ Haiku 4.5
+14.6pts at pass@1same model
41.6%
pre.dev w/ Haiku 4.5
beats
38.2%
Claude Code w/ Sonnet 4.5
+3.4pts at pass@1tier-skip
why pass@1

One trial per task. No best-of-N.

The Terminal-Bench leaderboard convention is mean over five trials. We chose pass@1 because it matches what an agent operator actually pays for. If you ship one PR per task in production, your accuracy is pass@1, not pass@5. Reporting pass@k smooths the variance but also smooths the cost — it makes the bill the user pays five times bigger than the number on the chart.

The configurations here would score higher with multiple trials. We just don't think you should pay for that hidden margin.

what changes between the bars

Three habits the agent practices on every task.

It plans before it codes.
The agent extracts a structured blueprint — milestones, subtasks, acceptance criteria — before any file is touched. Smaller models get traction they wouldn't have from a single prompt.
It has multiple ways to execute.
The agent has access to multiple methods of execution — a todo dependency graph for orchestrated work, RLM-based parallel analysis, and programmatic tool use for tight loops.
It verifies before passing.
A blind verifier re-runs the criteria against the agent's output and disagrees freely. Most of our false positives die here.

The model still does the reasoning. The architecture decides what reasoning to ask for, when, and against what.

open trajectories

Every decision the agent made is in the repo.

Every harbor jobs/<run>/<trial>/ from both runs is public, with the full trajectory.json and manifest.json per task. You can replay any decision the agent made.

We're not asking you to take the chart on faith.

what we didn't claim

The limits of the result, in plain sight.

  • Some task classes — heavy compiler debugging, multi-branch git surgery — still favor brute reasoning over orchestration in our results.
  • We hit the same infrastructure timeouts everyone hits at the Terminal-Bench task budget.

We're confident in the pattern, not in any single number.

the takeaway

You don't need a frontier lab to beat one.

The next 5× of agent capability isn't only on the lab's roadmap. It's also on yours. Which curve you're on is up to you.