Measuring all the noises of LLM Evals

Published 24 Dec 2025 in cs.LG, cs.AI, cs.CL, and stat.ML | (2512.21326v1)

Abstract: Separating signal from noise is central to experimental science. Applying well-established statistical method effectively to LLM evals requires consideration of their unique noise characteristics. We clearly define and measure three types of noise: prediction noise from generating different answers on a given question, data noise from sampling questions, and their combined total noise following the law of total variance. To emphasize relative comparisons and gain statistical power, we propose the all-pairs paired method, which applies the paired analysis to all pairs of LLMs and measures all the noise components based on millions of question-level predictions across many evals and settings. These measurements revealed clear patterns. First, each eval exhibits a characteristic and highly predictable total noise level across all model pairs. Second, paired prediction noise typically exceeds paired data noise, which means reducing prediction noise by averaging can significantly increase statistical power. These findings enable practitioners to assess significance without custom testing and to detect much smaller effects in controlled experiments.

Abstract PDF Upgrade to Chat

Summary

The paper quantifies distinct noise components in LLM evaluations by decomposing prediction and data variance using an all-pairs paired method.
It demonstrates that prediction noise typically dominates over data noise, and that averaging outputs can significantly boost statistical power.
The method provides reproducible and scalable metrics for model comparison through careful paired analysis, validated across diverse benchmarks.

Comprehensive Noise Analysis in LLM Evaluations

Introduction

"Measuring all the noises of LLM Evals" (2512.21326) addresses a central issue in experimental science: demarcating signal from noise within LLM evaluation setups. Unlike typical ML models or physical experiments, LLMs are inherently stochastic, introducing unique noise characteristics. The paper delineates and precisely quantifies three noise sources: prediction noise from stochastic sampling, data noise from finite test set sampling, and their combined total noise, with careful variance decomposition. The work's main contribution is the all-pairs paired method, which computes all noise components exhaustively for every LLM pair, providing practitioners with principled and reproducible means of drawing statistical inferences from LLM evals.

Noise Decomposition and Statistical Power

The authors adopt the law of total variance, rigorously separating data variance (due to question sampling) and prediction variance (due to model stochasticity on a fixed prompt). Notably, LLMs differ from classical classifiers: their outputs for a fixed prompt and random seed (sampling temperature) are distributed, not deterministic. Thus, repeated inference on the same example yields a distribution of outputs—allowing direct empirical measurement of prediction noise.

The authors make a substantive claim: in almost all standard LLM evals, the paired prediction noise is typically higher than paired data noise. This is empirically validated across millions of model-question pairs. As a result:

Reducing prediction noise (via output averaging or majority voting) provides much greater statistical power than merely increasing dataset size.
Paired analysis (comparing models' differences on identical questions) massively reduces data noise when models are similar.

The implication is that with careful averaging and paired statistical tests, much finer-grained comparisons between models are possible, contrary to prior assumptions that only test set size or unpaired accuracy metrics matter.

All-Pairs Paired Analysis

The paper introduces a scalable, general approach: for a suite of LLMs, for each model pair, prediction, data, and total variances are computed using extensive repeated sampling per question. The findings are highly regular:

Each evaluation set exhibits a characteristic, predictable total paired noise curve as a function of model accuracy.
This regularity enables the use of universal heuristic rules such as $\text{Var}[A - B] \approx \text{Var}[A] = p_A(1-p_A)$ (where $p_A$ is per-question accuracy), eliminating the need for bespoke significance testing per experiment.

By leveraging "all-pairs" analysis, practitioners can precompute, share, and interpret robust statistical error bars applicable to any new model, provided question-level metrics are available.

Methodological Comparison and Validation

The authors show their estimator is consistent with both bootstrap resampling and the sign test. For paired difference evaluation, all three yield the same answer, shifting experimental focus from custom error bar bootstrapping to global curve computation as promoted in this work.

Moreover, the paper addresses estimator bias/correction for finite sample sizes (questions per pool and predictions per question), with empirical investigations confirming accuracy and revealing that small-K corrections are critical for proper data variance estimation when averaging predictions per question.

Empirical Patterns and Beta Theory

The paper provides compelling empirical evidence—on benchmarks like SWEbench, MATH500, and CRUXEval—that the total paired noise as a function of accuracy matches the variance predicted by a simple Beta model, i.e., per-question accuracy follows $\mathrm{Beta}(p, 1 - p)$ . This regularity is robust across evals, temperatures, and LLM families, and it holds except on specific edge cases (e.g., deterministic models, settings with extremely low temperature, or pathologically imbalanced datasets).

Importantly, the prediction noise is typically much larger than the data noise at typical generation temperatures. When lowering temperature, data noise may increase relative to prediction noise, but the total noise curve remains predictable and largely unchanged.

Discussion: Implications and Recommendations

Practical Implications:

Practitioners can use the provided universal noise curves to assign significance levels to their results without custom analysis.
Controlled experiments between highly similar models benefit most from prediction averaging.
For leaderboard reporting, error bars should be constructed from all-pairs paired analysis, not by comparison with an arbitrary baseline.

Theoretical Implications:

The findings suggest data-driven regularities in LLM evals: similarity in training procedures leads to high inter-model per-question accuracy correlation, keeping data noise low for paired analysis.
Prediction noise dominance arises because LLM architectures, training data, and objective functions tightly constrain per-question performance variation.

Limitations and Caveats:

In scenarios where a single or extremely small number of critical questions are tested, the noise model may be less reliable, and more information per question will be necessary for statistical reliability.
Effects related to distributional shift, adversarial data, or domain-specific clustering are not extensively analyzed in this work.

Recommendations:

Whenever feasible, practitioners should average over multiple outputs per question, especially in controlled model comparison settings.
Meta-analysis across heterogeneous evals should aggregate per-eval $z$ -scores with carefully chosen weights, not simply aggregate raw accuracy scores or confidence intervals.

Future Directions

Potential directions include extending the methodology and analysis to non-correctness LLM metrics (e.g., preference-based human evaluation, generative or open-ended outputs), quantifying noise in fine-tuned or highly regularized models, and further characterizing the robustness and breakdowns of the Beta-theory noise approximation under various novel eval paradigms.

Conclusion

"Measuring all the noises of LLM Evals" establishes a refined and validated framework for analyzing experimental variance in LLM evaluations. The work demonstrates that LLM evals exhibit highly regular and predictable noise structure, where prediction noise dominates, and paired averaging is essential for statistical sensitivity. The all-pairs paired method, along with universal noise curves, provides a rigorous statistical basis for future LLM comparison, reporting, and meta-analysis, moving the community toward more reliable and reproducible benchmarks and model selection criteria.

Markdown Report Issue