Comparative Advantage: Calibrated Pairwise Win Rates from LLM Judge Scores
Causal Judge Evaluation (CJE; Landesberg, 2025) formalizes a workflow that is already widespread in practice: score every response with a cheap LLM judge, label a small slice with a trusted oracle, learn the judge-to-oracle mapping, and evaluate at scale with calibrated uncertainty. However, calibration is only as informative as the