Papers
Topics
Authors
Recent
Search
2000 character limit reached

How to Correctly Report LLM-as-a-Judge Evaluations

Published 26 Nov 2025 in cs.LG, cs.CL, stat.AP, and stat.ML | (2511.21140v1)

Abstract: LLMs are increasingly used as evaluators in lieu of humans. While scalable, their judgments are noisy due to imperfect specificity and sensitivity of LLMs, leading to biased accuracy estimates. Although bias-correction methods exist, they are underutilized in LLM research and typically assume exact knowledge of the model's specificity and sensitivity. Furthermore, in general we only have estimates of these values and it is not well known how to properly construct confidence intervals using only estimates. This work presents a simple plug-in framework that corrects such bias and constructs confidence intervals reflecting uncertainty from both test and calibration dataset, enabling practical and statistically sound LLM-based evaluation. Additionally, to reduce uncertainty in the accuracy estimate, we introduce an adaptive algorithm that efficiently allocates calibration sample sizes.

Summary

  • The paper introduces a bias correction framework that adjusts for specificity and sensitivity errors in LLM evaluations.
  • It develops statistically valid confidence intervals incorporating uncertainties from both test and calibration data.
  • The study proposes an adaptive calibration strategy to optimally allocate sample data, reducing bias and annotation costs.

Correct Statistical Practices for LLM-as-a-Judge Evaluation

Motivation and Problem Overview

The widespread substitution of human evaluation with LLM-based automated judgment in NLP and AI tasks offers scalability and efficiency. However, the inherent noise in LLM judges, stemming from their imperfect specificity (q0q_0) and sensitivity (q1q_1), induces systematic bias in commonly reported accuracy metrics. The naive estimator—typically the raw fraction of "correct" labels assigned by the LLM—yields a biased estimate of true model accuracy with respect to human ground-truth. This bias can be positive or negative depending on the underlying error profile of the judge and the region of ground-truth accuracy. Consequently, both overstatement and understatement of system-level model performance can occur, contaminating benchmarking, model selection, and longitudinal comparison. Figure 1

Figure 1

Figure 1

Figure 1: Illustration of the bias in the naive LLM-as-a-Judge estimator p^\hat{p} and the effectiveness of bias correction across the space of true accuracy θ\theta.

Bias Characterization and Statistical Correction Framework

The paper formalizes the bias as a function of the judge’s error rates, using parameters q0q_0 (specificity) and q1q_1 (sensitivity). The expected observed proportion E[p^]\mathbb{E}[\hat{p}] deviates from true accuracy θ\theta except in the limiting case of perfect judges (q0=q1=1q_0 = q_1 = 1). Specifically:

E[p^]=(q0+q11)θ+(1q0)\mathbb{E}[\hat{p}] = (q_0 + q_1 - 1)\theta + (1 - q_0)

This relationship reveals a critical threshold at θ=(1q0)/(2q0q1)\theta^* = (1 - q_0)/(2 - q_0 - q_1), at which the bias changes sign—overestimating θ\theta for smaller values, underestimating at higher ones.

The authors raise a central issue: direct reporting of p^\hat{p} is statistically invalid unless the LLM judge's confusion profile is perfectly known and invertible. While calibrating for q0q_0 and q1q_1 is standard in prevalence estimation [rogan1978estimating], such corrections are largely absent from LLM-based evaluation literature. Moreover, when q0q_0 and q1q_1 are themselves only estimated (through a finite calibration set), previous work failed to address proper confidence-interval construction that incorporates all relevant sources of uncertainty.

To resolve this, the paper proposes a plug-in estimator:

θ^=p^+q^01q^0+q^11\hat{\theta} = \frac{\hat{p} + \hat{q}_0 - 1}{\hat{q}_0 + \hat{q}_1 - 1}

where the calibration sample yields empirical estimates q^0\hat{q}_0 and q^1\hat{q}_1. This adjustment is shown to be, for large enough calibration size, uniformly lower in absolute bias than the naive estimator.

Uncertainty Quantification and Confidence Interval Construction

A major technical contribution is the construction of statistically valid confidence intervals for θ\theta that jointly incorporate uncertainty from both test data (estimation of p^\hat{p}) and calibration data (estimation of q^0\hat{q}_0, q^1\hat{q}_1). The resulting interval has the form:

θ~+dθ~±zαp~(1p~)/n~+(1θ~)2q~0(1q~0)/m~0+θ~2q~1(1q~1)/m~1(q~0+q~11)2\tilde{\theta} + d\tilde{\theta} \pm z_\alpha \sqrt{ \frac{ \tilde{p}(1 - \tilde{p})/\tilde{n} + (1 - \tilde{\theta})^2 \tilde{q}_0(1 - \tilde{q}_0)/\tilde{m}_0 + \tilde{\theta}^2 \tilde{q}_1(1 - \tilde{q}_1)/\tilde{m}_1 }{ (\tilde{q}_0 + \tilde{q}_1 - 1)^2 } }

where the tilde denotes "add-two" Wald-adjusted proportions [agresti2000simple]. This accounts for finite sample bias and improves coverage, and the authors advise truncating values to the [0,1][0,1] interval as necessary. Figure 2

Figure 2

Figure 2

Figure 2: The relationship between calibration dataset size and the confidence interval width under various test set accuracies shows diminishing returns and guides allocation strategy.

Adaptive Calibration and Optimal Sample Allocation

Another practical issue addressed is how to allocate the calibration budget. The authors find that the two components—calibration for specificity (m0m_0 for negatives) and calibration for sensitivity (m1m_1 for positives)—contribute asymmetrically to variance depending on the empirical accuracy and LLM error rates. They derive a closed-form optimal allocation ratio for (m0,m1)(m_0, m_1) under asymptotic analysis:

m0(1/p^1)κm1where κ=1q^01q^1m_0^* \approx (1/\hat{p} - 1) \sqrt{\kappa} \cdot m_1 \quad \text{where} \ \kappa = \frac{1-\hat{q}_0}{1-\hat{q}_1}

This is implemented in an adaptive pilot-based procedure that collects initial calibration data, estimates the variance ratio across conditions, and allocates the remainder optimally to minimize the final confidence interval width. Figure 3

Figure 3

Figure 3

Figure 3: Monte Carlo validation demonstrates unbiasedness of the adjusted estimate, high coverage of the proposed interval, and interval length reduction from adaptive allocation.

Empirical Validation

Extensive Monte Carlo simulations show that the bias-adjusted estimator θ^\hat{\theta} recovers the true accuracy across the spectrum of θ\theta, while the naive estimator’s bias can be large and of either sign. Empirical coverage of the proposed intervals holds at the desired nominal levels. The adaptive allocation strategy achieves consistently shorter intervals for fixed calibration effort, thus reducing the total annotation burden.

Implications and Future Considerations

Practically, these corrections are essential for sound benchmarking in any context using automated LLM evaluation, such as model-vs-model comparison, judging improvements, or automated leaderboard construction. The continuing evolution of LLM architectures and corresponding judge models with shifting specificity/sensitivity profiles amplifies the need for statutorily sound reporting procedures, particularly as differences in judged gains may be within the margin of error induced by bias and uncertainty in the judge itself.

Theoretically, this work extends prevalence adjustment literature—traditionally situated in epidemiology—to the AI evaluation paradigm, offering a data-driven methodology for fair and transparent reporting. Future extensions could consider multi-label classification, dependency across samples, or joint estimation when the same judge model is used for both calibration and testing datasets, as well as Bayesian formulations for small data regimes.

The algorithm and codebase enable direct integration of this framework in LLM evaluation pipelines. The methods presented provide a robust statistical foundation for LLM-as-a-Judge evaluation moving forward.

Conclusion

This work establishes robust, statistically principled estimators and confidence intervals for LLM-as-a-Judge evaluations. By providing bias adjustment and proper uncertainty quantification that reflect the dual noise sources from test and calibration data, along with guidance on sample allocation, it enables more reliable and interpretable use of LLM-based model evaluation. Adoption of these methods will enhance the rigor of future reporting and facilitate trustworthy comparison in empirical LLM research.

Reference:

"How to Correctly Report LLM-as-a-Judge Evaluations" (2511.21140)

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

Explain it Like I'm 14

Overview

This paper looks at a common way people evaluate AI models: using a LLM as a “judge” to decide if answers are correct. That’s fast and cheap, but LLM judges sometimes make mistakes. Those mistakes can make the reported accuracy of a model too high or too low. The paper shows a simple, practical way to fix this bias and to report honest uncertainty (how sure we are) about the final accuracy. It also explains how to collect just the right kind of extra data to make the uncertainty smaller.

Key Questions

The paper focuses on three easy-to-understand questions:

  • How can we adjust scores from an imperfect LLM judge so the accuracy isn’t biased?
  • How can we build a fair confidence interval (a “margin of error”) that reflects all sources of uncertainty?
  • How should we choose the mix of calibration examples to reduce that margin of error efficiently?

How Did They Do It?

The LLM judge’s two types of accuracy

Think of the LLM judge like a referee:

  • Sensitivity (q1q_1): When an answer is truly correct, how often does the judge say “correct”?
  • Specificity (q0q_0): When an answer is truly incorrect, how often does the judge say “incorrect”?

If the judge is imperfect (which it usually is), simply reporting the fraction of answers the judge marked “correct” will be biased.

Calibrating the judge

To understand the judge’s mistakes, the authors use a small “calibration” dataset where humans have already decided what’s truly correct or incorrect. On that set, they measure:

  • q^1\hat{q}_1: the judge’s accuracy on truly correct answers
  • q^0\hat{q}_0: the judge’s accuracy on truly incorrect answers

These numbers tell us how the judge behaves in both situations.

Correcting the score

Suppose on your big test set, the judge says a fraction p^\hat{p} of answers are “correct.” If the judge were perfect, you could trust p^\hat{p}. But because the judge makes errors, the paper uses a classical correction to estimate the true accuracy θ\theta:

θ^  =  p^+q^01q^0+q^11\hat{\theta} \;=\; \frac{\hat{p} + \hat{q}_0 - 1}{\hat{q}_0 + \hat{q}_1 - 1}

  • p^\hat{p}: what the judge said on the test set
  • q^0,q^1\hat{q}_0, \hat{q}_1: what you measured on the calibration set

This “plug-in” formula removes the judge’s bias. If the judge is very reliable, the correction is small; if the judge is less reliable, the correction matters more.

Confidence intervals (the margin of error)

A good report needs a confidence interval (CI) to say how sure we are about θ^\hat{\theta}. Here, uncertainty comes from two places:

  1. The test set (we only saw a sample of all possible questions).
  2. The calibration set (we estimated the judge’s accuracies).

The paper combines both sources to build a single CI around θ^\hat{\theta}, using a well-established method for proportions that stays reliable even when sample sizes aren’t huge. In short: the CI is wider when either the test set is small or the calibration set is small, and it shrinks as you collect more data.

Choosing calibration sizes wisely (adaptive allocation)

Not all calibration examples help equally. The paper shows how to split your calibration budget between:

  • examples that are truly incorrect (to estimate q0q_0), and
  • examples that are truly correct (to estimate q1q_1),

so that the final CI is as short as possible. The idea:

  • If the judge more often mislabels incorrect answers, spend more calibration on truly incorrect cases (improve q^0\hat{q}_0).
  • If the test set has few correct answers, that also changes which side you should focus on.

They propose a simple “pilot” step: collect a small, balanced calibration sample, estimate q^0\hat{q}_0 and q^1\hat{q}_1, and then allocate the rest of your calibration budget to the side that will shrink the CI most.

Main Results

  • The raw judge score (just “percent marked correct”) is biased when the judge isn’t perfect. It can overstate performance at low true accuracy and understate performance at high true accuracy.
  • The plug-in correction above removes most of that bias, giving a much more accurate estimate of the true accuracy against human judgement.
  • The new confidence interval properly reflects uncertainty from both datasets (test and calibration) and achieves the intended coverage (for example, a 95% CI really covers the true accuracy about 95% of the time in simulations).
  • The adaptive calibration strategy (how you split examples between truly correct and truly incorrect) consistently makes confidence intervals shorter compared to a simple 50/50 split.
  • There’s a ready-to-use Python implementation so researchers can apply this easily.

Why It Matters

As more papers use LLMs to judge other models, reported improvements could be inflated or deflated by judge errors, not real progress. This paper gives a clear, practical fix:

  • Correct the judge’s bias with a simple formula.
  • Report a confidence interval that accounts for uncertainty from both the test and calibration data.
  • Use an adaptive calibration plan to reduce the margin of error efficiently.

Following these steps makes model evaluations fairer, more trustworthy, and easier to compare across studies.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a concise, actionable list of what the paper leaves missing, uncertain, or unexplored, to guide future research.

  • Binary-only setting: Extend the framework beyond binary “correct/incorrect” judgments to multi-class labels, graded/continuous scores (e.g., scales, rubric sub-scores), and pairwise preferences (win rates), including corrected estimators and confidence intervals for each.
  • Constancy of judge accuracy: Investigate models where judge sensitivity/specificity (q1,q0q_1, q_0) vary with instance features, prompt templates, domains, or the system under evaluation; develop stratified or covariate-conditional calibration and corrections.
  • Distribution shift: Quantify and correct for shifts between calibration and test distributions (including output-style shift across different SUTs), and provide sensitivity analyses or transportability guarantees when q0,q1q_0, q_1 do not transfer.
  • Dependence between datasets: Relax the assumption that test and calibration data are independent and identically distributed; derive variance/CI formulas that are robust to clustering, temporal dependence, or shared-source correlation.
  • Judge drift over time: Develop procedures for monitoring and updating q0,q1q_0, q_1 under model updates, API changes, prompt instruction revisions, or temperature changes, including sequential/online recalibration with guarantees.
  • Imperfect “ground truth”: Model human label noise and annotator disagreement (e.g., latent class or Dawid–Skene-style models) and propagate this uncertainty through the bias correction and interval construction.
  • Subgroup fairness: Estimate and report subgroup-specific q0,q1q_0, q_1 (e.g., by topic, demographic attributes, input language), provide subgroup-aware corrections, and characterize fairness trade-offs in corrected accuracy.
  • Comparative evaluation: Provide corrected estimators and CIs for differences between systems, rankings across many systems, and paired testing designs, accounting for correlation induced by using the same judge.
  • Pairwise and ranking metrics: Generalize the approach to preference data (A/B tests, tournament win rates, rank aggregation), deriving misclassification-aware corrections and valid uncertainty quantification.
  • Small-sample and boundary behavior: Analyze finite-sample coverage of the proposed Wald-type CI, especially when sample sizes are small, q0,q1q_0, q_1 are extreme (near 0 or 1), or corrected estimates sit near the [0,1] boundaries; compare with exact, profile-likelihood, bootstrap, or Bayesian intervals.
  • Truncation effects: Quantify the impact of truncating corrected estimates/CIs to [0,1] on bias and coverage; evaluate alternatives (e.g., logit-transformed intervals or constrained optimization).
  • Overdispersion and correlated judge errors: Address overdispersion (e.g., beta-binomial/clustered errors) and topic/item-level dependencies that can make standard binomial variance understate uncertainty; propose cluster-robust or hierarchical variance estimators.
  • Edge cases and stability: Provide diagnostics and safeguards when q0+q11q_0 + q_1 \le 1 or close to 1 (ill-conditioned denominator), including regularization/shrinkage for qq estimates, conservative intervals, or decision rules for abandoning LLM-judge use.
  • Optimal allocation beyond approximations: Analyze the adaptive allocation rule without the “q0,q1q_0, q_1 close to 1” assumption; provide non-asymptotic bounds and robustness to using biased p^\hat{p} and noisy pilot estimates in the allocation.
  • Practical implementability of allocation: Address how to achieve target m0,m1m_0, m_1 when the true label z is unknown before annotation; design sequential labeling workflows (e.g., screen-then-confirm) and quantify cost/benefit.
  • Budgeted design: Develop closed-form/sample-size calculators for meeting target CI widths under joint constraints on test size n and calibration sizes m0, m1; incorporate monetary/time costs and propose optimal budget allocations.
  • Prompting and protocol sensitivity: Study how judge prompt design, rubric specificity, chain-of-thought visibility, and multi-turn judging affect q0,q1q_0, q_1, and how to standardize or calibrate across protocols.
  • Multiple judges and ensembling: Extend to settings with multiple LLM judges or human–LLM hybrids (e.g., majority vote, stacking), including joint estimation of judge-specific qq’s and ensemble-level corrections/intervals.
  • Multi-criteria evaluation: Generalize to vector-valued outcomes (e.g., helpfulness, harmlessness, factuality), including joint corrections, multiplicity control, and multivariate uncertainty reporting.
  • Strategic behavior and adversarial robustness: Model systems that exploit judge weaknesses; design evaluation protocols and corrections minimizing worst-case bias under targeted attacks.
  • Real-world validation: Apply the method to public LLM benchmarks with human ground truth to assess external validity, judge/version drift, subgroup effects, and the practical gains of adaptive allocation.
  • Reuse and transfer of calibration: Determine when a calibration obtained on task A can be safely reused on task B or system B; formalize criteria and tests for calibration portability.
  • Software robustness: Specify numerical stability safeguards (e.g., handling near-singular denominators, extreme proportions), default priors/shrinkage for qq estimation, and reproducibility standards in the provided implementation.

Practical Applications

Overview

The paper introduces a practical, statistically grounded framework for using LLMs as evaluators (“LLM-as-a-Judge”). It provides:

  • A bias-corrected estimator for true accuracy using the judge’s sensitivity and specificity.
  • A confidence interval that accounts for uncertainty from both test and calibration datasets.
  • An adaptive calibration-sample allocation algorithm to minimize interval length under fixed budgets.
  • A ready-to-use Python implementation: https://github.com/UW-Madison-Lee-Lab/LLM-judge-reporting.

Below are concrete applications across sectors and settings, organized by immediate and long-term horizons. Each item notes likely tools/workflows and key dependencies that impact feasibility.

Immediate Applications

These applications can be deployed now using the paper’s plug-in estimator, confidence intervals, and adaptive calibration design.

  • Industry (Software): Calibrated LLM evaluator for code-generation quality gates
    • Use case: Replace raw LLM-judge pass rates with bias-adjusted accuracy and 95% CI to gate model deployments (e.g., “merge if corrected accuracy ≥ target and CI width ≤ threshold”).
    • Workflow:
    • 1) Collect a small, human-labeled calibration set split into “correct” and “incorrect” code solutions.
    • 2) Compute corrected accuracy and CI with the provided Python package.
    • 3) Apply adaptive calibration to minimize CI length under budget.
    • Dependencies/assumptions: Stable prompts/judge behavior; calibration labels reflect the target domain; binary correctness mapping; independence between test and calibration datasets.
  • Industry (Trust & Safety): Confidence-aware LLM moderation filters
    • Use case: Calibrate an LLM that flags harmful content; report corrected precision/recall proxies via q0,q1q_0, q_1; set thresholds with CI-aware risk tolerance.
    • Tools: The plug-in estimator + CI to set operational thresholds; dashboards to track judge drift and recalibration needs.
    • Dependencies: Representative “harmful” vs “non-harmful” calibration labels; drift monitoring (judge reliability can change with new prompts/content); class imbalance handled via adaptive allocation.
  • Industry (Product/UX Analytics): A/B testing of LLM features with bias-adjusted evaluation
    • Use case: Compare model variants using corrected accuracy and overlapping CIs; avoid rollouts driven by judge bias.
    • Workflow: Integrate the estimator into experimentation platforms (e.g., MLflow/W&B metrics), log corrected metrics and CI.
    • Dependencies: Sufficient calibration size for desired CI length; consistent test distribution across variants.
  • Academia (Benchmarking & Papers): Correct reporting of LLM-as-a-Judge results
    • Use case: Replace naive “LLM says X% correct” with bias-adjusted estimates and CIs in publications and benchmarks.
    • Tools: GitHub implementation; methods section specifying judge q0,q1q_0,q_1, calibration sampling plan, CI construction, and adaptive allocation.
    • Dependencies: Availability of human-labeled calibration data; consistent judge prompts; clearly documented binary label mapping and truncation to [0,1].
  • Benchmark Hosts/Leaderboards: Confidence-aware leaderboards for judged tasks
    • Use case: Display corrected accuracy and uncertainty (CI width) for submissions judged by LLMs.
    • Tools: “Confidence-aware leaderboard” module using the estimator and CI; flags for insufficient calibration (e.g., q0+q11q_0 + q_1 \le 1).
    • Dependencies: Governance for minimum calibration standards; periodic re-calibration; clear reporting of judge parameters and sample sizes.
  • Education: Calibrated auto-grading for short answers/code exercises
    • Use case: When an LLM acts as a grader, report corrected pass rates and CI; adapt calibration effort to reduce uncertainty for high-stakes assessments.
    • Tools: Instructor dashboard integrating adaptive calibration allocation and CI reporting; policy for escalation to human grading when CI is too wide.
    • Dependencies: Ground-truth rubric consistency; domain shift across assignments; student fairness considerations (documented judge reliability).
  • Finance & Legal (Compliance Review): Evaluating LLM summaries/classifications with calibrated judges
    • Use case: Use an LLM judge to assess whether generated reports meet compliance criteria; report corrected compliance rates with CI for auditability.
    • Tools: Calibration on representative “compliant”/“non-compliant” documents; CI-aware thresholds for release decisions.
    • Dependencies: High-quality, expert-labeled calibration data; tight domain focus; periodic reassessment under regulatory changes.
  • Daily Life (LLM App Builders and Open-Source Maintainers): Trustworthy evaluation of app responses
    • Use case: For productivity or Q&A apps that internally rely on an LLM judge, present corrected accuracy and uncertainty to users and contributors.
    • Tools: Lightweight integration of the Python package; small pilot calibration to set initial judge parameters; adaptive allocation to improve confidence quickly.
    • Dependencies: Willingness to collect a small number of human labels; maintaining stable judge prompts; clarity on what “correct/incorrect” means for the app.

Long-Term Applications

These require further research, standardization, scaling, or tooling maturation.

  • Standards & Policy: Reporting requirements for LLM-as-a-Judge
    • Use case: Publishers, funders, and standards bodies (e.g., ISO/IEC JTC 1/SC 42, NIST, ACM/ACL) adopt guidance mandating bias-adjusted estimates and CIs when LLMs are evaluators.
    • Potential outcomes: Template sections in papers/model cards; minimum calibration budgets; judge drift monitoring policies.
    • Dependencies: Community consensus; tooling support in benchmark platforms; reviewer training.
  • Generalized Judging Beyond Binary: Graded rubrics, multi-class, and pairwise preferences
    • Use case: Extend methods to ordinal scores (rubrics), multi-label tasks, and preference judgments (A vs B).
    • Potential tools: Generalized estimators and CIs for non-binary judgments; preference calibration where “sensitivity/specificity” map to Bradley–Terry or Thurstone models.
    • Dependencies: Statistical extensions; new calibration designs; robust assumptions for more complex judgment types.
  • Judge Optimization & Selection: Choosing or training judges to maximize q0+q1q_0+q_1 and minimize uncertainty
    • Use case: Evaluate multiple LLM judges/prompts; select the one with best calibrated reliability or ensemble judges for improved specificity/sensitivity.
    • Potential products: “Judge selection” toolkits; prompt governance and judge ensembling; meta-evaluation dashboards.
    • Dependencies: Cost/latency constraints; reliable estimation of judge parameters across domains; fairness considerations.
  • Continuous/Online Calibration with Drift Detection
    • Use case: Maintain CI quality under changing distributions (topics, languages, adversarial inputs) by streaming calibration and automated alerts when q0,q1q_0,q_1 shift.
    • Potential workflows: Active learning to pick calibration samples likely to reduce CI; periodic recalibration cycles.
    • Dependencies: Monitoring infrastructure; labeled feedback loops; budget for ongoing human labeling.
  • Cross-Domain and Multilingual Fairness Audits
    • Use case: Calibrate judges across demographics/languages/domains; report domain-specific corrected metrics and CIs to surface misalignment and bias.
    • Potential tools: Stratified calibration and reporting; fairness dashboards overlaying corrected metrics.
    • Dependencies: Representative calibration sampling; governance for subgroup reporting; ethical review.
  • Platform Integrations: EvaluationOps in ML toolchains
    • Use case: First-class support in ML platforms (MLflow, Weights & Biases, Hugging Face Evaluate) for bias-adjusted metrics and CI logging.
    • Potential products: “Calibrated Judge SDK,” “Confidence-aware evaluation” plugins; CI width as a deployment KPI.
    • Dependencies: API design; standardized metadata (judge prompt, calibration sizes, q0,q1q_0,q_1 estimates); organizational buy-in.
  • Procurement and Regulatory Compliance
    • Use case: Government/enterprise procurement requires calibrated reporting for LLM-evaluated benchmarks; regulators accept CI-aware evidence for safety claims.
    • Potential outcomes: Contractual clauses; compliance checklists; third-party audits leveraging corrected estimates.
    • Dependencies: Legal frameworks; auditor expertise; sector-specific calibration requirements (e.g., healthcare safety thresholds).
  • Cost-Aware Calibration Planning and Marketplaces
    • Use case: Optimize calibration budgets across tasks/products; marketplaces for expert calibration labeling; automated allocation beyond the paper’s pilot-based heuristic.
    • Potential tools: Budget planners that target CI length; integrated active sampling; shared calibration pools.
    • Dependencies: Labeler availability; task variation; standard contracts and data governance.

Key Assumptions and Dependencies (Cross-Cutting)

  • Calibration data quality: Requires human-verified labels that reflect the evaluation domain.
  • Stability: Judge accuracy (q0,q1q_0,q_1) depends on prompts, instructions, and context; changes require recalibration.
  • Independence: Test and calibration datasets should be independent; leakage biases estimates.
  • Binary mapping: The current estimator assumes binary correctness. Non-binary judgments need methodological extensions.
  • Sample sizes: Desired CI width dictates calibration budget; adaptive allocation helps but cannot replace insufficient labeling.
  • Distribution shift: If the test distribution differs materially from calibration, corrected estimates may misrepresent true accuracy; drift monitoring is important.
  • Transparency: Report all judge settings, calibration sizes, and CI construction details to ensure reproducibility and trust.

By adopting the paper’s methods and tooling, organizations can improve the reliability, transparency, and comparability of LLM-based evaluations today, while laying the groundwork for standardized, fairness-aware judging practices in the future.

Glossary

  • Adaptive allocation: A procedure that allocates calibration samples between label types to minimize confidence-interval length using pilot estimates. "we introduce an adaptive allocation procedure in Algorithm~\ref{alg:allocation}."
  • Adjusted Wald interval: A corrected Wald confidence interval for proportions that adds pseudo-counts to improve coverage, often summarized as “add two successes and two failures.” "following the ``add two successes and two failures'' adjusted Wald interval approach"
  • Asymptotic variance: The variance of an estimator in the large-sample limit, often derived via the delta method to construct approximate confidence intervals. "the asymptotic variance of θ^\hat{\theta} is"
  • Bias-adjusted estimator: An estimator that corrects bias due to judge misclassification by incorporating sensitivity and specificity (e.g., Rogan–Gladen style). "producing the bias-adjusted estimator θ^\hat{\theta}."
  • Binomial distribution: The probability distribution modeling the number of successes in a fixed number of independent Bernoulli trials; used to derive variances of proportion estimates. "Because pp follows a binomial distribution"
  • Calibration dataset: A labeled dataset with human ground-truth used to estimate the judge’s sensitivity and specificity parameters (q1q_1, q0q_0). "they can be estimated from a calibration dataset with ground-truth labels"
  • Confidence interval: An interval estimate that quantifies uncertainty about a parameter (here, true accuracy θ\theta), incorporating both test and calibration randomness. "and construct a confidence interval that accounts for uncertainty arising from both the test and calibration datasets"
  • Coverage probability: The proportion of times a confidence interval contains the true parameter across repeated samples, used to assess interval reliability. "Fig.~\ref{fig:mc-simulation-cp} reports the empirical coverage probability of the confidence interval in~\eqref{eq:ci}."
  • Delta method: A technique to approximate the variance of a function of estimators using a Taylor expansion, central to deriving the variance of θ^\hat{\theta}. "By applying the delta method~\citep{dorfman1938note, ver2012invented}, the asymptotic variance of θ^\hat{\theta} is"
  • Error ratio: The ratio of error probabilities across label types, typically (1q~0)/(1q~1)(1-\tilde{q}_0)/(1-\tilde{q}_1), used to guide optimal calibration allocation. "Compute the estimated error–ratio: κ^(1q~0)/(1q~1)\hat{\kappa} \gets (1 - \tilde{q}_0)/(1 - \tilde{q}_1)."
  • Indicator function: A function that returns 1 if a condition holds and 0 otherwise, used to define empirical accuracies. "where 1{}\mathbf{1}\{\cdot\} denotes the indicator function."
  • Law of total probability: A rule that decomposes probabilities across a partition, used here to relate observed judge predictions to true accuracy and judge accuracies. "By the law of total probability, we have"
  • LLM-as-a-Judge: An evaluation paradigm where an LLM is used as the judge in place of humans to label correctness or quality. "LLM-as-a-Judge evaluation involves two sources of uncertainty:"
  • Misclassification bias: Bias in an estimated accuracy arising from a judge’s false positives/negatives, which the adjustment aims to correct. "which exactly corrects for the LLM’s misclassification bias."
  • Monte Carlo simulation: A computational experiment using repeated random sampling to assess estimator bias, coverage, and interval length. "Monte Carlo simulation for estimating θ\theta under an imperfect LLM judge with (q0,q1)=(0.7,0.9)(q_0, q_1) = (0.7, 0.9)."
  • Pilot calibration sample: A small initial calibration set used to estimate judge accuracies and inform adaptive allocation of the full calibration budget. "The algorithm begins by collecting a small pilot calibration sample (e.g., $m_{\mathrm{pilot} = 10$ for each label type)"
  • Plug-in estimator: An estimator obtained by substituting estimated nuisance parameters (e.g., q^0\hat{q}_0, q^1\hat{q}_1) into a correction formula. "derive a simple plug-in estimator that corrects the resulting bias"
  • Prevalence estimation: Estimating the true proportion of positives corrected for imperfect tests, providing the classical basis for the bias adjustment. "a classical result from prevalence estimation \citep{rogan1978estimating} provides an exact adjustment."
  • Quantile (standard normal): The value cutting off a specified upper tail probability of the standard normal distribution, used to set confidence-level multipliers. "zαz_{\alpha} denotes the (1α/2)(1-\alpha/2) quantile of the standard normal distribution"
  • Sensitivity: The judge’s true positive rate, Pr(Z^=1Z=1)\Pr(\hat{Z}=1\mid Z=1), i.e., probability of marking a truly correct item as correct. "where q1q_1 and q0q_0 are LLM’s sensitivity and specificity."
  • Specificity: The judge’s true negative rate, Pr(Z^=0Z=0)\Pr(\hat{Z}=0\mid Z=0), i.e., probability of marking a truly incorrect item as incorrect. "where q1q_1 and q0q_0 are LLM’s sensitivity and specificity."
  • Wald-type confidence interval: A confidence interval based on a normal approximation using the estimator’s variance, here applied after adjustments to improve coverage. "yields the Wald-type confidence interval shown in \eqref{eq:ci}"

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 40 tweets with 1158 likes about this paper.