All-Pairs Paired Method

Updated 25 December 2025

The All-Pairs Paired Method is a framework that decomposes variance into data noise and prediction noise for robust model evaluation.
It employs systematic paired comparisons across all model pairs to precisely estimate uncertainty and improve statistical power.
The method integrates bias correction, power analysis, and error estimation to optimize experiment design and detect subtle differences.

The All-Pairs Paired Method is a statistically rigorous framework for quantifying noise and conducting significance testing in model evaluation, especially for LLMs. It systematically decomposes variance into interpretable sources—data noise and prediction noise—by applying paired analysis across all model pairs, thereby enabling practitioners to reliably estimate uncertainty and optimize evaluation protocols for statistical power (Wang, 24 Dec 2025).

1. Noise Decomposition and the Law of Total Variance

The All-Pairs Paired Method starts with a formal variance decomposition. Suppose $A(x, s)$ denotes the metric output (e.g., accuracy, score) of model $A$ on question $x$ and stochastic sample $s$ . If evaluating $N$ questions with $K$ independent samples per question (with $x_i$ drawn from the empirical question set and $s_{ij}$ denoting stochastic seeds), then:

Prediction noise quantifies variation from sampling $s$ : $\,_x\bigl[\Var_s A(x, s)\bigr]$
Data noise measures the variation in expected metric values across questions: $\Var_x\bigl[\E_s A(x, s)\bigr]$

The total variance under the law of total variance is

$\Var_{x, s}[A] = \Var_x[\E_s A(x, s)] + \;_x[\Var_s A(x, s)]$

The method precisely estimates both terms, isolating how much noise arises from finite question sampling (data noise) and from stochastic model behavior (prediction noise) (Wang, 24 Dec 2025).

2. Estimation Algorithms and All-Pairs Computation

For a single model $A$ , scores are organized in an $N \times K$ matrix $A_{ij}=A(x_i, s_{ij})$ . Row means $\mu_i$ and row variances $v_i$ are computed as:

$\mu_i = \frac{1}{K}\sum_{j=1}^K A_{ij}$
$v_i = \frac{1}{K} \sum_j (A_{ij} - \mu_i)^2$

A bias correction $b = \frac{1}{K-1} \frac{1}{N} \sum_{i=1}^N v_i$ is subtracted for small $K$ .

For a model pair $(A, B)$ , the paired difference $D_{ij} = A_{ij} - B_{ij}$ enables all variance and covariance computations:

Paired total variance: $\widehat{\Var}_{x, s}[A-B] = \text{var}(A_{ij}) + \text{var}(B_{ij}) - 2\,\text{cov}(\mu_i^A, \mu_i^B)$
Paired data variance: $\text{var}(\mu_i^A - \mu_i^B) - (b_A + b_B)$
Paired prediction variance: $\text{mean}(v_i^A + v_i^B) + (b_A + b_B)$

The method computes these for all $\binom{M}{2}$ model pairs, storing results in three $M\times M$ symmetric matrices. This provides a complete characterization of model comparison noise in the evaluation (Wang, 24 Dec 2025).

3. Practical Workflow and Pseudocode

Given $M$ models, $N$ questions, and $K$ samples per question:

For each model $m$ , precompute question-level means $\mu[m][i]$ and variances $v[m][i]$ .
Compute the bias-correction $b_m$ .
For each unordered model pair $(a, b)$ $(a, b)$ :
- Compute total, data, and prediction variances as above.
Convert variances into standard error of the mean-difference for each pair: $\mathrm{SE}_{\text{pair}}(a, b) = \sqrt{\text{total\_var}[a, b] / N}$ .

This procedure enables downstream statistical tests and confidence interval construction directly from $N \times K$ per-model score arrays.

4. Significance Testing and Impact of Averaging

For two models $A, B$ , the difference-of-means $\Delta=\bar A - \bar B$ over $N$ questions is approximately Normal-distributed (CLT), permitting $z$ -tests:

$z = \frac{\Delta}{\mathrm{SE}_{\mathrm{pair}}(A, B)}$

A key result is how averaging $K$ samples per question impacts noise:

$\Var_{x, s}[\bar A_K - \bar B_K] = \Var_x[\E_s(A - B)] + \frac{1}{K}\,_x[\Var_s(A - B)]$

This structure implies that increasing $K$ reduces prediction noise by $1/K$, shrinking $\mathrm{SE}_{\mathrm{pair}}$ and thus increasing statistical power, while data noise remains unaffected. When paired prediction noise is the dominant term, even modest increases in $K$ can yield substantial gains in sensitivity. This directly impacts the minimum detectable effect size for a given $N, K$ (Wang, 24 Dec 2025).

5. Power Analysis and Sample Size Calculation

To plan evaluations capable of detecting a target mean difference $\delta$ with significance $\alpha$ and power $1-\beta$ , sample size is computed as:

$N = \left( (z_{1-\alpha/2} + z_{1-\beta}) \cdot \frac{\sqrt{\sigma_{\mathrm{tot}}^2}}{\delta} \right)^2$

with $\sigma_{\mathrm{tot}}^2 = \Var_x[\E_s(A-B)] + \frac{1}{K}\,_x[\Var_s(A-B)]$. Example parameterizations confirm substantial $N$ reductions when prediction noise is controlled via averaging (e.g., fivefold reduction in $N$ when increasing $K$ from $1$ to $5$ given typical noise ratios) (Wang, 24 Dec 2025).

6. Methodological Considerations and Best Practices

Pairing is always advantageous: When comparing models on the same questions, paired analysis leverages shared data, reducing variance via covariance subtraction.
Multiple samples per question ( $K > 1$ ) are critical: Direct estimation and subsequent reduction of prediction noise increases statistical power.
Bias correction for small $K$ : Omitting the $b$ correction systematically overestimates data noise; it is essential for valid decomposition.
Procedure/metric dependence: Averaging may alter intended evaluation (e.g., majority-vote vs. mean accuracy); practitioners must exactly match measurement to evaluation protocol.
Reporting: Always report prediction and data noise separately for transparent analysis, as this informs experiment design (e.g., whether more questions or more samples per question will be more effective in reducing standard error).
Exact inference for small $N$ : For $N \lesssim 100$ , nonparametric methods (bootstrap, sign test) may be preferable to Normal approximations.

A rule-of-thumb for binary metrics is $\Var_{x, s}[A - B] \approx p(1-p)$ for $p \approx \bar{A} \approx \bar{B}$ , justifying error-bar estimation when custom paired analysis is unavailable. These practices ensure that model comparisons and leaderboards report well-calibrated uncertainty estimates even in complex, stochastic settings.

7. Applications and Significance in Model Evaluation

The All-Pairs Paired Method provides a principled, reproducible statistical protocol for large-scale LLM and model evaluation. By analyzing all ${M\choose2}$ model pairs, fully decomposing uncertainty, and providing exact formulas for $z$ -tests and power calculations, it supports the design and assessment of more sensitive, reproducible benchmarks and controlled experiments. It enables the reliable detection of small effects, diagnostic analysis of noise sources, and experiment planning, all directly from per-question output data and with clear recipes for robust significance testing (Wang, 24 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Measuring all the noises of LLM Evals (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to All-Pairs Paired Method.