Which Capability Actually Broke? A Calibrated 5-Axis Judge for Agent Tool Use

Agent evaluation today gives you one of two signals, and neither answers the question you actually have.

The first is an end-task success rate. It tells you the agent reached the goal, or it did not. When it fails 38 percent of retail tasks, the success rate cannot tell you which capability of the agent broke. The second is LLM-as-judge: a frontier model graded by another frontier model. It is increasingly common, but it is usually treated as a black box. A single judge call produces a single verdict, and you have no defense against that judge's own biases, prompt-design noise, or the self-favoring tendency a model shows when grading others from its own family.

The question a practitioner actually wants to ask is narrower and more useful: given an agent that fails some fraction of tasks, which capability should I prioritize fixing?

This post describes a framework I built to answer that. It decomposes tool-use correctness into five axes, grades each one with judges from different model families, and takes calibration seriously enough that the resulting numbers mean something. The contribution is methodological rigor, not a new technique. Per-axis decomposition, cross-family judging, and judge calibration are all known ideas with published precedent. What is missing publicly is a clean implementation that treats calibration as a first-class concern. This is one.

The code, rubrics, and runnable examples are at github.com/brihat9135/agent-judge-calibration.

Five axes of tool-use correctness

Tool-use correctness is not a single property. A failure on a tool-using task can trace to at least five distinct capability breakdowns, and the framework grades each one independently.

Tool selection. Did the agent pick the right tool for the step? Calling a refund tool when the user asked to modify an order is a tool-selection failure, regardless of whether the arguments were well-formed.
Argument validity. Given the chosen tool, were the arguments well-formed and grounded in the trajectory context? An identifier fabricated from training memory rather than read from a prior tool result is an argument-validity failure.
Sequencing. Across the trajectory, were tools called in a sensible order? Processing a refund before verifying return eligibility is a sequencing failure, even if each individual call was correct in isolation.
Result interpretation. After each tool returned, did the agent read the result correctly and act on it? An agent that misreads a "$1,619.34 refund" as "$95.08" makes a result-interpretation error.
Termination. Did the agent stop at the right time? Both stopping early (leaving the user's needs unresolved) and going too long (calling tools after the user signaled completion) are termination failures.

These axes are deliberately not orthogonal. The same step can implicate more than one: a wrong tool with fabricated arguments surfaces on both tool selection and argument validity. The decomposition is useful precisely because a single end-task failure usually traces to a specific subset of axes rather than all of them. A per-axis breakdown across many trajectories tells you which capability to target.

One design choice matters a lot here. Each axis rubric is conditional on its peers' scope, with explicit out-of-scope declarations. The argument-validity rubric, for example, explicitly defers tool-choice questions to tool selection and result-reading questions to result interpretation. Without those declarations, judges bleed outcome-orientation across the axes. I will show that happening, and show how I fixed it.

The pipeline

The framework runs as a pipeline from a benchmark trajectory to a per-axis diagnostic. A trajectory and the five axis rubrics fan out to two judges from different model families, currently Anthropic Claude Sonnet 4.6 and OpenAI GPT-5. Each judge produces a structured verdict per axis. The verdicts persist as JSONL and aggregate into three outputs: inter-judge agreement per axis, a per-agent failure profile across the five axes, and a worked-example trace showing which capability broke. When the two judges disagree on an axis, that disagreement feeds back into the rubric: the response to low agreement is to refine the rubric, not to dismiss the disagreement.

I demonstrate it on the retail trajectories shipped with tau-bench v1.0.0, using the benchmark's reference trajectories from four frontier agents (Claude 3.7 Sonnet, GPT-4.1, GPT-4.1-mini, o4-mini). The framework runs no agents of its own, only judge calls, so it adds modest cost on top of any benchmark that exposes trajectory data.

Cross-family judges, structured outputs

Cross-family rather than cross-model is the discipline. Two grades from the same family share architectural blind spots, training-data overlap, and prompt-sensitivity patterns. Cross-family disagreement is the signal the framework relies on.

Each judge call is forced to produce structured output through the model's native mechanism (Anthropic's forced tool use, OpenAI's strict JSON schema). The schema requires four fields:

Field	Type	Behavior
`verdict`	enum: correct, incorrect, uncertain, not_applicable	The judgment
`cited_step_indices`	list[int]	Trajectory step indices grounding the verdict
`rationale`	string	Explanation referencing the cited steps
`confidence`	float in [0, 1]	The judge's confidence in this verdict

The cited_step_indices field is load-bearing. Without it, an LLM judge can produce a plausible-sounding rationale that is entirely disconnected from the trajectory. Requiring step citations forces grounding: a judge that cites step 50 on a 26-step trajectory is detectably hallucinating. In practice both judges consistently cite valid step indices, and I observed no hallucinated citations across the production run.

Three senses of calibration

"Calibration" means three different things here, and the methodology treats each one separately.

Prompt calibration. Rubric prompts are tuned on a held-out set of trajectories and locked before any evaluation set is touched. The numbers below come from the locked prompts run on trajectories the prompts were not iterated against. This is the same discipline production ML eval uses to avoid overfitting to a sample.

Cross-family calibration. For every (trajectory, axis) pair, both judges score independently, and inter-judge agreement is reported per axis on every batch run. High agreement plus high confidence is the strongest signal the framework produces. Low agreement, especially at low confidence, flags an axis or trajectory where the eval itself is uncertain.

Human-grounded calibration. Inter-judge agreement says two judges agree. It does not say either one matches what a human would say. The planned next step is to hand-label a sample of around 50 trajectories on all five axes and report per-axis judge-versus-human agreement. Until that lands, the headline numbers should be read as "the eval system is internally consistent," not "the eval system is correct."

These layers compose. Without prompt calibration, headline numbers are inflated. Without cross-family calibration, single-family biases go undetected. Without human-grounded calibration, two confident-but-wrong judges that agree produce a confidently wrong eval system. Each layer is necessary and none is sufficient alone. This work implements the first two and explicitly flags the third as a gap.

Calibration in action: the argument-validity loop

This is the part that shows the methodology has bite. The argument-validity rubric went through three iterations, each driven by data, each captured as a separate commit. The trail is auditable.

v1: high-confidence disagreement

The v1 rubric defined argument validity as "well-formed and semantically faithful," and asked whether "the arguments accurately represent what the user asked, rather than a plausible-looking substitute." That was outcome-oriented language. I ran it on sim 14 of the tau-bench retail GPT-4.1 reference simulations, a multi-step trajectory whose final outcome did not match the user's request.

Claude returned incorrect at confidence 0.95. GPT-5 returned correct at confidence 0.97. High confidence on both sides, opposite verdicts. The rationales explain why:

Claude (v1): "the agent passed order_id: '#W2575533' which is correct, however this results in cancelling the entire order rather than just the Garden Hose. The user only wanted to cancel the hose."

GPT-5 (v1): "All tool calls used correct identifiers and values consistent with prior tool results and the user's intent: user lookup by provided name/zip, fetching listed orders, returning the exact item IDs."

Both readings were defensible against v1's framing. Claude read "faithful to the user's stated intent" as outcome-oriented (the arguments led to the wrong outcome). GPT-5 read it as schema-oriented (the arguments were syntactically valid for the called tool, with grounded identifiers). The disagreement traced to genuine rubric ambiguity, not judge error.

v2: narrow the scope, partial convergence

In v2 I stated explicitly that the axis is conditional on tool choice, added an "explicitly NOT in scope" section pointing outcome questions to tool selection and result-reading to result interpretation, and dropped the problem phrase. Re-running on sim 14:

	v1	v2
Claude verdict	incorrect	incorrect
Claude confidence	0.95	0.60
GPT-5 verdict	correct	correct
GPT-5 confidence	0.97	0.96
Agreement	no	no

GPT-5 did not move. Claude's confidence dropped 35 points but the verdict held at incorrect. Its v2 rationale showed it reasoning about the rubric's intent: "the agent's argument to cancel_pending_order was grounded in a real order ID, but the decision to cancel the full order reflects a mismatch, though this may be a tool_selection issue." The judge was caught between the rubric instruction and the trajectory's outcome.

That is a useful finding on its own. Rubric instructions can move a judge's confidence substantially without flipping its verdict. Bare-verdict agreement is a brittle metric. A confidence-weighted version would have registered v2 as real progress that bare matching missed.

v3: worked examples, and the judges converge

In v3 I added three worked examples directly to the rubric, anchoring the verdict pattern:

Example A. The agent calls cancel_pending_order(order_id="X") with a valid order_id from a prior tool result. This tool cancels the whole order, but the user only wanted one item cancelled. Verdict: CORRECT. The argument is schema-valid and grounded. The fact that the chosen tool is too coarse for the user's intent is a tool_selection failure, scored there, not here.

The closing instruction added a directive: the verdict you assign must match the worked-example pattern, even if the trajectory's overall outcome was bad. Re-running on sim 14:

	v1	v2	v3
Claude verdict	incorrect (0.95)	incorrect (0.60)	correct (0.98)
GPT-5 verdict	correct (0.97)	correct (0.96)	correct (0.96)
Agreement	no	no	yes
Cited-step overlap	minimal	partial	strong (both cite 4, 6, 8, 15, 20)

Worked examples were the necessary anchor. Verbal scope declarations alone were not. The full commit history is public (commits 40fd06e, d26e30b, 12ff3dc), so anyone can trace the iteration step by step.

Three iterations is the calibration loop in action. The cross-family disagreement was the signal, the rubric refinement was the response, and re-running was the verification. Rubric design is not a one-shot exercise, and the methodology is built to support that kind of iteration as new trajectories surface new ambiguities.

A failure traced end to end

To show what the framework produces in practice, here is one trajectory in full. Sim 14 of the tau-bench retail GPT-4.1 run (task 28) is a multi-action customer-service interaction. The user asks the agent to return five items from a recent order, and to cancel only the garden hose from a separate pending order. The benchmark's end-task reward for this trajectory is 0.0, a failure.

The 26-step trajectory shows the agent look up the user by name and zip (step 4), fetch user details (step 6), retrieve orders in parallel (step 8), confirm intent (step 13), and then at step 15 call cancel_pending_order(order_id="#W2575533"). That is the critical step. At step 19 the tool returns a full refund of $1,619.34 for the entire pending order, which contained the hose and four other items. At step 22 the agent reports to the user a "cancellation refund: $95.08," which is only the hose's price.

Two errors compound. The agent chose a whole-order cancel tool when the user wanted a per-item action, then misread the tool's response and told the user only the hose was cancelled. The customer was misinformed about the actual state of their orders.

End-task success says: task failed, score 0. It cannot say more. The five-axis framework, both judges scoring under the v3 rubrics, produces this:

Axis	Both judges	Confidence (Claude / GPT-5)	What it means
tool_selection	incorrect	0.90 / 0.86	Whole-order cancel was wrong for a per-item intent
argument_validity	correct	0.98 / 0.95	The order_id was schema-valid and grounded
sequencing	correct	0.98 / 0.90	Context was gathered before any state change
result_interpretation	incorrect	0.97 / 0.93	Reported $95.08 when the tool returned $1,619.34
termination	correct	0.99 / 0.93	Ended cleanly after user confirmation

This is the diagnostic the reward alone cannot produce. The failure traces specifically to tool selection plus result interpretation. Sequencing was fine. Termination was fine. The agent did the right things in the right order with valid arguments. It picked the wrong tool, then misreported the consequence. For improvement, the implication is concrete: tightening this agent's tool selection would address the failure, and improving its sequencing or termination would not.

The two judges also ground on the same evidence. On tool selection both cite step 15, the destructive call. On result interpretation both pair step 19 (the tool result) with step 22 (the mis-report). They are converging on the same evidence, not landing on the same verdict by coincidence.

Results

I ran the framework on 20 retail trajectories spanning four frontier agents (5 trajectories per agent on the same tau-bench v1.0.0 retail task IDs: 28, 50, 67, 68, 74), under both judges across all five axes. The full 200-verdict run completed with zero errors.

Inter-judge agreement per axis

Axis	n=20 inter-judge agreement
tool_selection	19/20 (95%)
argument_validity	19/20 (95%)
sequencing	18/20 (90%)
result_interpretation	17/20 (85%)
termination	18/20 (90%)

Three things stand out. First, argument validity at 95 percent agreement at n=20 is strong evidence the v3 refinement was not a one-trajectory artifact. It holds across four distinct agent families. Second, result interpretation is the noisiest axis at 85 percent, consistent with what I saw at smaller scale, where Claude's confidence on the disagreeing verdicts ran low. This is the axis where the rubric or the trajectory is genuinely borderline more often. Third, the 5 to 15 percent disagreement across axes is exactly what the methodology should surface. The 100 percent agreement I saw at smoke-run scale was a small-sample artifact; n=20 gives the more truthful, partially-bounded numbers.

Per-agent failure profiles

The most actionable output is the per-agent, per-axis failure rate. The metric here is "any judge returned incorrect," which is what a practitioner would use to spot capability gaps. The stricter consensus rate (both judges incorrect) is in the supplementary results file.

Axis	claude-3-7-sonnet	gpt-4.1	gpt-4.1-mini	o4-mini
tool_selection	1/5 (20%)	1/5 (20%)	2/5 (40%)	0/5 (0%)
argument_validity	0/5 (0%)	0/5 (0%)	1/5 (20%)	0/5 (0%)
sequencing	1/5 (20%)	0/5 (0%)	2/5 (40%)	0/5 (0%)
result_interpretation	2/5 (40%)	2/5 (40%)	2/5 (40%)	0/5 (0%)
termination	0/5 (0%)	0/5 (0%)	1/5 (20%)	1/5 (20%)

Four findings come straight off the table.

o4-mini is the strongest agent on this sample. Perfect on four of five axes, with one termination failure. Success metrics alone would tell you it is strong; the decomposition tells you exactly which capability is still imperfect and which are not.

gpt-4.1-mini is the weakest, and the only agent here to show argument-validity failures. That column is the most discriminating one in the table, separating gpt-4.1-mini from the other three.

claude-3-7-sonnet and gpt-4.1 have nearly identical profiles, except for sequencing. Two agents of similar overall quality differ in one specific capability that the decomposition surfaces. This is the "similar success rates, different failure modes" claim, shown empirically.

Result interpretation is the most universal failure mode for the three non-o4-mini agents, each at 40 percent. If you were prioritizing post-training data collection across these agents, result interpretation would be the highest-ROI target on this sample.

Two notes on the judges themselves

GPT-5 reports systematically lower confidence than Claude on the same trajectories. Across runs, GPT-5 confidence in the 0.6 to 0.95 range tracks Claude confidence in the 0.85 to 0.98 range. Confidence cannot be compared directly across families without normalization, so the framework treats it as a per-judge quantity and does not aggregate it across families.

Both judges are non-deterministic across repeated calls. In one case (sim 14, tool selection, Claude) the same model with the same prompt produced correct at 0.98 on one call and incorrect at 0.90 on a later call. This is a known property of LLM judges, and it is the strongest argument for the cross-judge methodology: a single judge call in isolation is not reliable enough to act on. Consensus across families plus high confidence is.

Where this sits, and what it does not yet have

Public agent-eval tooling falls into a few layers. Eval frameworks like Inspect, OpenAI Evals, and promptfoo are infrastructure; how to calibrate a judge is left to the user. Judge libraries like DeepEval and RAGAS provide model-graded scorers but do not make calibration discipline a first-class feature. Benchmark-bundled evaluators (tau-bench, BFCL, AgentBench, WebArena, SWE-bench) are end-task scorers that do not decompose failure into capability axes. SaaS observability tools offer tracing plus evaluation, but cross-judge agreement and per-axis decomposition are not their focus. The contribution here is a small, focused, open-source artifact whose differentiation is methodology rigor rather than breadth. It is something any of those frameworks could adopt, or that a practitioner can fork directly.

I will name the limits explicitly, because honest scope-flagging is part of the rigor I am arguing for.

Sample size. Headline numbers come from n=20. The next target is n=50 to n=100 across multiple domains.
Human-grounded calibration is not implemented yet. The current work has prompt calibration and cross-family calibration but not the human-labeled layer. Until it lands, the strongest honest claim is internal consistency, not correctness.
Single benchmark. Results are on tau-bench retail. The framework is benchmark-agnostic by design (anything representable as role-typed messages with tool calls and results is compatible), but breadth confirmation is future work.
Rubric design is not fully systematic yet. Only argument validity has been stressed by cross-judge contention. The other four axes may reveal similar ambiguities under more runs. The rubrics are living artifacts in versioned source control, not final products.

The actual argument

End-task success rates are necessary but not sufficient for understanding agent failure. The five-axis decomposition surfaces which capability broke when an agent failed, and cross-family judge agreement provides a credibility layer that single-judge methodology does not. The rigor lives in the calibration discipline, not in the technique, which builds on published precedent.

What this work argues for is less a new tool than a discipline. When you put an LLM in the judge seat, calibrate it. When you only have one judge, get a second from a different family. When you publish a methodology, publish the iterations that produced it. The hardest part of LLM-as-judge work is not implementing it. It is taking calibration seriously enough that the verdicts mean something.

Code, rubrics, and runnable examples: github.com/brihat9135/agent-judge-calibration.