LLM-as-Judge Evaluators

Updated 24 December 2025

The paper demonstrates that Balanced Accuracy and Youden’s J uniquely capture prevalence-gap preservation critical for reliable LLM judge selection.
LLM-as-Judge evaluators are methods where large language models automatically assess model behaviors, especially under class imbalance, for robust performance comparisons.
Empirical case studies and simulations show that BA/J outperforms traditional metrics like Accuracy and F1, providing actionable protocols for real-world deployments.

LLMs are now widely employed as automated classifiers of model behaviors—ranging from safety violations to task performance rates—through paradigms collectively termed “LLM-as-Judge” (LaJ) evaluators. In prevalence estimation tasks that underpin core NLP benchmarks, deployability analyses, and policy setting, the statistical metric used to select and validate LLM-based judges is fundamental for credible model-to-model comparisons. The manuscript “Balanced Accuracy: The Right Metric for Evaluating LLM Judges—Explained through Youden’s J statistic” rigorously analyzes the landscape of prevalence metrics and demonstrates, both theoretically and empirically, that Balanced Accuracy (BA) and Youden’s J statistic (J) are uniquely appropriate for LaJ settings, especially under strong class imbalance. The following sections give a structured account of definitions, theoretical arguments, empirical evidence, practical protocols, and applied recommendations for the use of LaJ evaluators anchored in balanced accuracy (Collot et al., 8 Dec 2025).

1. Formal Metrics for Judge Evaluation

Let $N = TP + TN + FP + FN$ denote the number of labeled instances in a golden validation set, with $TP$ , $TN$ , $FP$ , and $FN$ denoting counts of true/false positives/negatives for a given candidate judge model. The principal classification metrics utilized in the literature are:

Metric	Formula
Accuracy	$\displaystyle \frac{TP + TN}{TP + TN + FP + FN}$
Precision	$\displaystyle \frac{TP}{TP + FP}$
Recall (TPR)	$\displaystyle \frac{TP}{TP + FN}$
Specificity (TNR)	$\displaystyle \frac{TN}{TN + FP}$
F $_1$ Score	$TP$ 0
Youden's $TP$ 1	$TP$ 2
Balanced Accuracy	$TP$ 3

Balanced Accuracy and Youden’s $TP$ 4 are linearly related via $TP$ 5.

2. Theoretical Justification: Why Balanced Accuracy and Youden’s $TP$ 6

The core conceptual argument is that the goal of a judge in LaJ pipelines is to measure the true prevalence gap between models as faithfully as possible, independent of class balance. Let two models differ in true prevalence by $TP$ 7. A candidate judge with sensitivity $TP$ 8 (TPR) and false-positive rate $TP$ 9 will report an apparent prevalence difference of $TN$ 0. Therefore, $TN$ 1 (and by linearity, Balanced Accuracy) exactly captures the “scaling slope” by which the judge propagates true prevalence gaps.

Critical properties:

Prevalence independence: TPR and TNR are conditional on the true class and invariant to the marginal class distribution.
Class symmetry: BA/J weight both classes equally, penalizing errors on minor and major classes identically.

In contrast,

Precision degrades catastrophically when positives are rare.
Accuracy can be trivially maximized by always predicting the majority class.
F $TN$ 2, and its macro variant, ignore true negatives and can be volatile under class imbalance.

No other common metric cleanly encodes the judge's ability to preserve true between-model prevalence gaps under imbalance.

3. Empirical Evidence: Case Studies and Simulation

Empirical studies, including two real-world violation detection tasks and a Monte Carlo simulation, show BA/J reliably out-select the judge best preserving prevalence gaps.

Case Study 1: Policy Violation Detection (8.3% prevalence, $TN$ 3)

Judge A: Precision=0.32, Recall=0.76, Specificity=0.85, J=0.61, F $TN$ 4=0.45, Macro-F $TN$ 5=0.68, Accuracy=0.85, BA=0.81
Judge B: Precision=0.41, Recall=0.57, Specificity=0.92, J=0.49, F $TN$ 6=0.47, Macro-F $TN$ 7=0.71, Accuracy=0.90, BA=0.75

Standard metrics (Accuracy, F $TN$ 8, Macro-F $TN$ 9) incorrectly select Judge B. Only BA/J rank Judge A—truth-faithful on the rare class—higher.

Case Study 2: 20% prevalence

Analogous pattern; only BA/J select the correct judge.

Simulation: 100,000 scenarios, 3 judges × 5 models ( $FP$ 0)

Balanced Accuracy: success rate=0.752, mean rank-gap=0.033 (lowest)
Macro-F $FP$ 1: 0.707, 0.049
Accuracy: 0.675, 0.067
F $FP$ 2: 0.617, 0.094

Selecting by Balanced Accuracy yields the highest probability of identifying the rank-faithful judge with minimal deviation when errant selections occur.

4. Practical Protocol for LaJ Evaluator Selection Using Balanced Accuracy

a. Golden-set construction: Collect a representative validation set (1,000–2,000 items) with expert/consensus labels. Class balance is ideal but not required.

b. Confusion matrix computation: For every candidate judge, compute TP, FP, TN, FN.

c. Balanced Accuracy calculation: For each candidate,

$FP$ 3

d. Threshold selection (if outputs are continuous): Maximize Youden’s $FP$ 4 on a hold-out or tuning set for optimal discrimination.

e. Selection: Rank judges by Balanced Accuracy. If the task is extremely recall-sensitive, consider inspecting TPR/FPR directly.

f. Multi-class tasks: Compute macro-averaged Balanced Accuracy:

$FP$ 5

where $FP$ 6 is the number of classes.

5. Implications for Prevalence Estimation and Model Comparison

Balanced Accuracy (or $FP$ 7) is the only scalar metric that guarantees prevalence-gap preservation, a required property for downstream model comparison or evaluation release gating.
Reporting per-class TPR/TNR or a confusion matrix with BA aids transparency and trust.
Relying exclusively on metrics such as Accuracy, F $FP$ 8, Precision, or Macro-F $FP$ 9 under class imbalance can cause systematic judge selection errors, leading to over- or under-estimation of actual prevalence and, correspondingly, flawed model rankings.

6. Recommendations for LLM-as-Judge (LaJ) Practice

Adopt Balanced Accuracy (or Youden’s $FN$ 0) as the primary metric for any LaJ system used for prevalence estimation, particularly where class distributions are skewed or shifting.
Always publish the full confusion matrix or per-class recall rates for any reported judge’s evaluation.
For continuous scoring judges, tune operating thresholds to maximize $FN$ 1 on held-out validation sets (i.e., ROC curve optimization).
For safety-critical or imbalanced settings, avoid defaulting to legacy metrics that may mask bias or dampen prevalence resolution.
In large-scale deployments, ensure that golden-set size is sufficient (up to $FN$ 2 labels); marginal returns diminish beyond this point.

By grounding model selection in Balanced Accuracy, researchers and practitioners align judge choice with theoretically justified, empirically robust measures that guarantee faithful, prevalence-independent comparisons in LLM benchmarking and risk assessment pipelines (Collot et al., 8 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Balanced Accuracy: The Right Metric for Evaluating LLM Judges -- Explained through Youden's J statistic (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LLM-as-Judges (LaJ) Evaluators.