Benchmark leaderboards measure task completion under lab conditions, not the compounding step failures that sink agents in production. Here is the math, and a blueprint for an eval harness that actually predicts reliability.
Every few weeks a new model tops SWE-bench or GAIA or WebArena, and the accompanying blog post implies the model is now meaningfully closer to being trustworthy for real work. Then teams wire that model into an agent, point it at a real codebase or a real ticket queue, and watch it fail in ways the benchmark never suggested were possible. This isn't a benchmark quality problem in the sense of "the tasks are too easy" — most of these suites are genuinely hard. It's a structural mismatch between what single-task benchmarks measure and what determines whether an agent survives contact with production.
The thesis of this post: task-completion accuracy on a fixed benchmark is a weak predictor of production reliability because it hides the thing that actually kills agents — compounding per-step error over long, branching trajectories in a nondeterministic environment. If you want a number that predicts whether your agent will still be trustworthy at step 40 of a real task, you have to measure something benchmarks don't report by default: trajectory-level survival, not task-level pass/fail.
Most popular agent benchmarks report a single aggregate number: percent of tasks solved. That number is an average over trajectories of wildly different lengths, and averaging hides the exponential decay that governs multi-step success.
Consider an agent with a per-step success rate of 95% — genuinely good, better than most production tool-calling setups achieve today. Over a single 3-step task, the probability of a clean run is 0.95³ ≈ 0.857, which looks fine on a leaderboard. But production agentic workflows — a multi-file refactor, a customer-support resolution that touches five backend systems, a research task that chains a dozen searches and reads — routinely run 20, 40, or 100 steps.
| Steps in trajectory | P(step succeeds) = 0.95 | P(step succeeds) = 0.99 |
|---|---|---|
| 3 | 85.7% | 97.0% |
| 10 | 59.9% | 90.4% |
| 20 | 35.8% | 81.8% |
| 50 | 7.7% | 60.5% |
| 100 | 0.6% | 36.6% |
The gap between 95% and 99% per-step reliability looks trivial in a single-step benchmark cell. Over 50 steps it's the difference between an agent that finishes the job most of the time and one that fails more often than it succeeds. This is why a model that tops a benchmark built from short, well-scoped tasks can still be unusable in a long-horizon production agent: the benchmark never exercised the exponent.
Most published benchmarks skew short precisely because short tasks are easier to verify, cheaper to run at scale, and produce cleaner leaderboards. SWE-bench tasks average a handful of file edits. WebArena and Mind2Web episodes are usually under 15 actions. GAIA's harder tier gets closer to real multi-step research, which is part of why scores on it are so much lower across every model family — it's measuring something closer to the exponent, not just the base rate.
It helps to be precise about what each popular suite is and isn't testing, because teams often treat "SWE-bench score" as a proxy for "good coding agent" when it's really a proxy for a narrower thing.
| Benchmark | Task type | Typical trajectory length | What it's blind to |
|---|---|---|---|
| SWE-bench (Verified) | Fix a GitHub issue in a real repo | 5–15 tool calls | Long-horizon planning, tool flakiness, ambiguous specs |
| WebArena / Mind2Web | Browser task completion | 5–20 actions | Multi-session state, auth flows, adversarial page changes |
| GAIA | Multi-hop research/reasoning | 10–40 actions | Cost/latency tradeoffs, partial-credit recovery |
| tau-bench / τ-bench | Tool-use + policy adherence in a simulated business domain | 10–30 tool calls | Real API nondeterminism, rate limits, schema drift |
| Internal production traces | Whatever the job requires | Often 50+ | (This is the ground truth the others approximate) |
None of these are bad benchmarks — they're useful for what they isolate. The mistake is reading "78% on SWE-bench Verified" as "this agent is 78% reliable," when the number is really "this agent resolves 78% of a curated set of short, well-specified, single-repo bug fixes under generous retry budgets." Production tasks are rarely that well-specified, and production environments are rarely that stable.
Tool determinism. Benchmark environments are usually sandboxed and deterministic — the same API call returns the same shape every time. Production tools drift: a third-party API adds a field, a rate limit kicks in mid-trajectory, an MCP server times out and returns a partial response. An agent that never had to handle a malformed tool result during eval has no learned behavior for it in production.
Retry masking. Many benchmark harnesses allow multiple attempts per task, or score "pass@k" instead of pass@1, then report the more flattering number. Production agents usually get one shot at a stateful action — you can't "retry" a database write or a sent email the way you retry a benchmark rollout. A pass@5 number silently assumes a rollback capability that doesn't exist for side-effecting tools.
Distribution shift in the task itself. Benchmark tasks are static and vetted. Real user requests are underspecified, sometimes contradictory, and drift over time as the surrounding product changes. An agent tuned to a fixed benchmark distribution can overfit to the phrasing and structure of that benchmark's task templates rather than to the underlying skill.
A single pass/fail label for a task collapses all of this into one bit. The "plausible-looking failure" branch — where the agent completes every step, produces a coherent-looking result, and is simply wrong — is the most dangerous one in production, and it's exactly the branch that a binary benchmark score can't distinguish from a clean success unless the verifier is unusually rigorous.
The fix isn't to abandon standard benchmarks — they're still useful for model selection and regression testing. It's to add a second layer that scores trajectories, not just outcomes, against your own production tool surface.
Three changes matter most:
1. Score per-step and per-trajectory, not just per-task. Log every tool call, its latency, whether it errored, and whether the agent's next action was a sane response to that error. A task can "pass" while masking three tool failures the agent happened to route around — that's signal, not noise, and it should show up in the eval report.
2. Replay against your real tools, including their failure modes. If your agent calls internal APIs, a database, or MCP servers, the eval harness should hit staging versions of those same integrations — including their real rate limits, real schema, and real latency distribution — rather than a mocked, always-succeeds stand-in. An agent that only ever practiced against a perfect mock of, say, a utility API has no calibrated behavior for what to do when that same call (currency conversion, timestamp math, a regex check) returns a 429 or a slightly different payload shape. This is also why exposing tools through a stable, well-typed interface — an MCP server or a documented REST API — pays off during eval, not just in production: it gives you one seam to record and replay against.
3. Inject faults deliberately. Take your real trajectory logs and mutate them: truncate a tool response, delay it, return a subtly wrong value instead of an error. Measure whether the agent notices, and what fraction of the time noticing turns into correct recovery versus confidently wrong output. This is the single highest-signal number this post can offer as a metric to track: recovery rate under injected tool faults, not benchmark accuracy under ideal conditions.
A minimal version of this scoring loop:
def score_trajectory(trace):
steps = trace.steps
n = len(steps)
hard_fail = any(s.status == "unrecoverable" for s in steps)
tool_errors = [s for s in steps if s.tool_error]
recovered = [s for s in tool_errors if s.next_step_status == "sane"]
return {
"task_passed": trace.final_verdict == "pass",
"trajectory_length": n,
"tool_error_rate": len(tool_errors) / max(n, 1),
"recovery_rate": len(recovered) / max(len(tool_errors), 1) if tool_errors else None,
"hard_failure": hard_fail,
"implied_step_success_rate": (1 - len(tool_errors) / max(n, 1)),
}
Run this over a batch of trajectories, and you get a distribution of implied per-step success rates you can plug back into the compounding-error math from earlier — which is a far more honest predictor of "will this agent still work at step 60" than any single leaderboard percentage.
A benchmark score tells you how an agent performs on tasks that were curated to be gradeable. It does not tell you how the same agent behaves when a real tool times out on step 34 of a 60-step job, because that scenario was mocked away or never long enough to occur. If you're choosing a model or shipping an agent into production, treat published benchmarks as a coarse filter for capability, then build a second, smaller eval that replays your actual tools with their actual failure modes and scores recovery rate, not just pass/fail. The number that should worry you isn't "we scored 3 points lower than last quarter's model on GAIA" — it's "our measured per-step success rate implies a 40% chance of silent failure by step 50," because that's the number your users will actually experience.