Both systems ran on the harbor reference harness against the Terminal-Bench 2.0 task set (n=89, pass@1, identical task IDs). pre.dev numbers are our single-trial runs. Claude Code numbers are derived from CC's own public Terminal-Bench 2.0 submissions on tbench.ai: for each task we take the chronologically-first of CC's published trials and compute pass@1 from that single trial.
Mean across all of CC's published trials per task matches each published leaderboard accuracy to four decimal places. The scraper, cached pages, and parsed JSON are public.
The Terminal-Bench leaderboard convention is mean over five trials. We chose pass@1 because it matches what an agent operator actually pays for. If you ship one PR per task in production, your accuracy is pass@1, not pass@5. Reporting pass@k smooths the variance but also smooths the cost — it makes the bill the user pays five times bigger than the number on the chart.
The configurations here would score higher with multiple trials. We just don't think you should pay for that hidden margin.
The model still does the reasoning. The architecture decides what reasoning to ask for, when, and against what.
Every harbor jobs/<run>/<trial>/ from both runs is public, with the full trajectory.json and manifest.json per task. You can replay any decision the agent made.
We're not asking you to take the chart on faith.
We're confident in the pattern, not in any single number.
The next 5× of agent capability isn't only on the lab's roadmap. It's also on yours. Which curve you're on is up to you.