Suppose you run two agents on the same set of tool-use tasks and they land on the same end-task success rate. The natural read is that they are about equally good. That read is often wrong, and the way it is wrong matters if your next move is to improve one of them.
End-task success is a single bit per task: did the agent reach the goal. When it says no, it does not say which capability broke. So two agents with identical success rates can be failing for entirely different reasons, and the aggregate number hides exactly the thing you need in order to fix them.
I ran into this directly while building a calibrated judge for tool-use trajectories. The judge scores each trajectory on five axes (tool selection, argument validity, sequencing, result interpretation, termination) rather than a single pass/fail. Here is the per-agent, per-axis failure rate from a run across four frontier agents on matched retail tasks, where "failure" means at least one judge marked that axis incorrect.
| Axis | claude-3-7-sonnet | gpt-4.1 | gpt-4.1-mini | o4-mini |
|---|---|---|---|---|
| tool_selection | 1/5 (20%) | 1/5 (20%) | 2/5 (40%) | 0/5 (0%) |
| argument_validity | 0/5 (0%) | 0/5 (0%) | 1/5 (20%) | 0/5 (0%) |
| sequencing | 1/5 (20%) | 0/5 (0%) | 2/5 (40%) | 0/5 (0%) |
| result_interpretation | 2/5 (40%) | 2/5 (40%) | 2/5 (40%) | 0/5 (0%) |
| termination | 0/5 (0%) | 0/5 (0%) | 1/5 (20%) | 1/5 (20%) |
Read it column by column and the story is not "some agents are better than others." It is that each agent has a shape.
claude-3-7-sonnet and gpt-4.1 look like twins until you reach sequencing, where one slips and the other does not. Same overall quality, one specific divergence. If you only had a success rate, you would never see it.
gpt-4.1-mini is the only agent that fails argument validity at all. That one column is the most discriminating signal in the table. It separates the weakest agent from the rest more cleanly than any aggregate would.
result_interpretation is everyone's problem except o4-mini: three of the four agents fail it 40 percent of the time. If you were collecting post-training data across this whole set, that is the single highest-return axis to target, and it is invisible until you decompose.
o4-mini is strong, but not uniformly. Perfect on four axes, one termination slip. "It is the best agent" is true and useless. "Its only remaining weakness is knowing when to stop" is actionable.
The point is not these specific numbers. It is the altitude. A success rate answers "is it good." A failure profile answers "what would I fix first," which is the question you actually have once an agent is in front of you. Two agents with the same score, different failure modes, is the normal case, not the exception.
This is a slice of a larger piece on building the judge that produces these profiles, including how the judges are calibrated so the per-axis numbers are trustworthy. If you want the methodology, read the full write-up.