notes & essays

Writing

Usually on evaluation and how to know whether a system actually works.

Two Agents, the Same Score, Different Failures2026-06-09
Aggregate success rate tells you an agent failed. It will not tell you that two agents with the same score fail in completely different ways. A short look at why per-axis failure profiles are the more useful number.
Writing a Judge Rubric Two Models Can Agree On2026-06-09
When two LLM judges from different families disagree at high confidence, the problem is usually the rubric, not the judges. Here is the three-iteration loop that got them to converge, and the one change that actually did it.
Which Capability Actually Broke? A Calibrated 5-Axis Judge for Agent Tool Use2026-06-04
End-task success tells you an agent failed, not why. Here is a 5-axis decomposition of tool-use correctness, graded by cross-family LLM judges, with the calibration discipline that makes the verdicts mean something.