LLM-as-Judge Evaluators
- The paper demonstrates that Balanced Accuracy and Youden’s J uniquely capture prevalence-gap preservation critical for reliable LLM judge selection.
- LLM-as-Judge evaluators are methods where large language models automatically assess model behaviors, especially under class imbalance, for robust performance comparisons.
- Empirical case studies and simulations show that BA/J outperforms traditional metrics like Accuracy and F1, providing actionable protocols for real-world deployments.
LLMs are now widely employed as automated classifiers of model behaviors—ranging from safety violations to task performance rates—through paradigms collectively termed “LLM-as-Judge” (LaJ) evaluators. In prevalence estimation tasks that underpin core NLP benchmarks, deployability analyses, and policy setting, the statistical metric used to select and validate LLM-based judges is fundamental for credible model-to-model comparisons. The manuscript “Balanced Accuracy: The Right Metric for Evaluating LLM Judges—Explained through Youden’s J statistic” rigorously analyzes the landscape of prevalence metrics and demonstrates, both theoretically and empirically, that Balanced Accuracy (BA) and Youden’s J statistic (J) are uniquely appropriate for LaJ settings, especially under strong class imbalance. The following sections give a structured account of definitions, theoretical arguments, empirical evidence, practical protocols, and applied recommendations for the use of LaJ evaluators anchored in balanced accuracy (Collot et al., 8 Dec 2025).
1. Formal Metrics for Judge Evaluation
Let denote the number of labeled instances in a golden validation set, with , , , and denoting counts of true/false positives/negatives for a given candidate judge model. The principal classification metrics utilized in the literature are:
| Metric | Formula |
|---|---|
| Accuracy | |
| Precision | |
| Recall (TPR) | |
| Specificity (TNR) | |
| F Score | 0 |
| Youden's 1 | 2 |
| Balanced Accuracy | 3 |
Balanced Accuracy and Youden’s 4 are linearly related via 5.
2. Theoretical Justification: Why Balanced Accuracy and Youden’s 6
The core conceptual argument is that the goal of a judge in LaJ pipelines is to measure the true prevalence gap between models as faithfully as possible, independent of class balance. Let two models differ in true prevalence by 7. A candidate judge with sensitivity 8 (TPR) and false-positive rate 9 will report an apparent prevalence difference of 0. Therefore, 1 (and by linearity, Balanced Accuracy) exactly captures the “scaling slope” by which the judge propagates true prevalence gaps.
Critical properties:
- Prevalence independence: TPR and TNR are conditional on the true class and invariant to the marginal class distribution.
- Class symmetry: BA/J weight both classes equally, penalizing errors on minor and major classes identically.
In contrast,
- Precision degrades catastrophically when positives are rare.
- Accuracy can be trivially maximized by always predicting the majority class.
- F2, and its macro variant, ignore true negatives and can be volatile under class imbalance.
No other common metric cleanly encodes the judge's ability to preserve true between-model prevalence gaps under imbalance.
3. Empirical Evidence: Case Studies and Simulation
Empirical studies, including two real-world violation detection tasks and a Monte Carlo simulation, show BA/J reliably out-select the judge best preserving prevalence gaps.
Case Study 1: Policy Violation Detection (8.3% prevalence, 3)
- Judge A: Precision=0.32, Recall=0.76, Specificity=0.85, J=0.61, F4=0.45, Macro-F5=0.68, Accuracy=0.85, BA=0.81
- Judge B: Precision=0.41, Recall=0.57, Specificity=0.92, J=0.49, F6=0.47, Macro-F7=0.71, Accuracy=0.90, BA=0.75
Standard metrics (Accuracy, F8, Macro-F9) incorrectly select Judge B. Only BA/J rank Judge A—truth-faithful on the rare class—higher.
Case Study 2: 20% prevalence
Analogous pattern; only BA/J select the correct judge.
Simulation: 100,000 scenarios, 3 judges × 5 models (0)
- Balanced Accuracy: success rate=0.752, mean rank-gap=0.033 (lowest)
- Macro-F1: 0.707, 0.049
- Accuracy: 0.675, 0.067
- F2: 0.617, 0.094
Selecting by Balanced Accuracy yields the highest probability of identifying the rank-faithful judge with minimal deviation when errant selections occur.
4. Practical Protocol for LaJ Evaluator Selection Using Balanced Accuracy
a. Golden-set construction: Collect a representative validation set (1,000–2,000 items) with expert/consensus labels. Class balance is ideal but not required.
b. Confusion matrix computation: For every candidate judge, compute TP, FP, TN, FN.
c. Balanced Accuracy calculation: For each candidate,
3
d. Threshold selection (if outputs are continuous): Maximize Youden’s 4 on a hold-out or tuning set for optimal discrimination.
e. Selection: Rank judges by Balanced Accuracy. If the task is extremely recall-sensitive, consider inspecting TPR/FPR directly.
f. Multi-class tasks: Compute macro-averaged Balanced Accuracy:
5
where 6 is the number of classes.
5. Implications for Prevalence Estimation and Model Comparison
- Balanced Accuracy (or 7) is the only scalar metric that guarantees prevalence-gap preservation, a required property for downstream model comparison or evaluation release gating.
- Reporting per-class TPR/TNR or a confusion matrix with BA aids transparency and trust.
- Relying exclusively on metrics such as Accuracy, F8, Precision, or Macro-F9 under class imbalance can cause systematic judge selection errors, leading to over- or under-estimation of actual prevalence and, correspondingly, flawed model rankings.
6. Recommendations for LLM-as-Judge (LaJ) Practice
- Adopt Balanced Accuracy (or Youden’s 0) as the primary metric for any LaJ system used for prevalence estimation, particularly where class distributions are skewed or shifting.
- Always publish the full confusion matrix or per-class recall rates for any reported judge’s evaluation.
- For continuous scoring judges, tune operating thresholds to maximize 1 on held-out validation sets (i.e., ROC curve optimization).
- For safety-critical or imbalanced settings, avoid defaulting to legacy metrics that may mask bias or dampen prevalence resolution.
- In large-scale deployments, ensure that golden-set size is sufficient (up to 2 labels); marginal returns diminish beyond this point.
By grounding model selection in Balanced Accuracy, researchers and practitioners align judge choice with theoretically justified, empirically robust measures that guarantee faithful, prevalence-independent comparisons in LLM benchmarking and risk assessment pipelines (Collot et al., 8 Dec 2025).