When you use an LLM to judge another model's work, the failure you should worry about is not the judge being wrong. It is two judges from different families being confidently wrong in opposite directions. That is a signal that your rubric is ambiguous, and no amount of prompt politeness fixes it. You have to find the ambiguity and remove it.
Here is a concrete case from building a judge for tool-use trajectories. One axis, argument validity, asks a narrow question: given the tool the agent chose, were the arguments well-formed and grounded in the trajectory. The first version of the rubric phrased it as "well-formed and semantically faithful," and asked whether the arguments represented "what the user asked, rather than a plausible-looking substitute."
That sounds reasonable. It was not. Run on a trajectory where the agent used a valid order ID but the resulting action did not match the user's intent, two judges split hard:
Claude: the argument is correct, but it cancels the whole order rather than the one item the user wanted, so this is incorrect (confidence 0.95).
GPT-5: all tool calls used correct identifiers consistent with prior results and the user's intent, so this is correct (confidence 0.97).
Both readings are defensible against that wording. One judge read "faithful to intent" as being about the outcome. The other read it as being about the arguments themselves. The disagreement was not noise. It was the rubric asking two questions at once.
What followed was three iterations, each driven by the disagreement.
v2: narrow the scope. I added an explicit conditionality clause ("this axis assumes the tool choice and asks only about the arguments") and an "explicitly not in scope" section pointing outcome questions to a different axis. Re-running:
| v1 | v2 | |
|---|---|---|
| Claude | incorrect (0.95) | incorrect (0.60) |
| GPT-5 | correct (0.97) | correct (0.96) |
| Agreement | no | no |
Look at what moved. The verdicts did not change, but Claude's confidence fell 35 points. The judge was clearly wrestling with the new instruction and the bad outcome at the same time. That is worth noticing on its own: a rubric edit can move a judge's confidence a lot without flipping its verdict, which means bare agree/disagree metrics understate progress. v2 was real improvement that a verdict-only view called a total miss.
v3: stop describing, start showing. Verbal scope was not enough, so I added three worked examples directly to the rubric, each pairing a concrete situation with the correct verdict and a one-line reason:
The agent calls a whole-order cancel with a valid order ID, but the user wanted one item cancelled. Verdict: CORRECT. The argument is schema-valid and grounded. The tool being too coarse for the intent is a tool-selection failure, scored elsewhere.
Plus a closing directive: the verdict must match the worked-example pattern, even when the overall outcome was bad. Re-running:
| v1 | v2 | v3 | |
|---|---|---|---|
| Claude | incorrect (0.95) | incorrect (0.60) | correct (0.98) |
| GPT-5 | correct (0.97) | correct (0.96) | correct (0.96) |
| Agreement | no | no | yes |
They converged, at high confidence, and started citing the same trajectory steps for the same reasons. The worked examples were the thing that did it. Verbal scope declarations moved the judge partway; a concrete example anchored it the rest of the way.
Three lessons I took from this:
- Cross-family disagreement is a gift. Two judges from the same family share blind spots and will agree while both being wrong. Disagreement across families is what surfaces the ambiguity. Treat it as the bug report.
- Worked examples beat descriptions. An instruction tells the judge what to value. An example shows it what to output. The second is far more reliable when the line is subtle.
- Confidence is data, not decoration. A verdict that holds while confidence drops 35 points is a rubric that is working but not done. Watch the confidence, not just the verdict.
None of this is exotic. It is just taking the judge seriously as something you calibrate, the same way you would calibrate any other measurement instrument before trusting its readings.
This is a piece of a longer write-up on the full five-axis judge and its calibration, including the inter-judge agreement numbers at scale. If you want the whole thing, it is here.