Papers
Topics
Authors
Recent
Search
2000 character limit reached

Pass@$k$ Inference Scaling

Updated 12 November 2025
  • Pass@$k$ inference scaling is defined as the probability that at least one of k generated completions is correct, revealing diminishing returns as k increases.
  • It employs statistical models like the Beta-Binomial framework to understand power-law decay and forecast performance improvements on reasoning benchmarks.
  • Practical strategies such as dynamic sampling, entropy-aware generation, and minimax-optimal selection are recommended to balance compute efficiency with reliability.

Pass@kk inference scaling refers to the empirical and theoretical study of how predictive accuracy improves as the number of independent model-generated solutions per problem (kk) is increased at inference time. The pass@kk metric, closely associated with "coverage," measures the probability that at least one out of kk generated completions is correct for a given task instance. Pass@kk scaling is central to reasoning benchmarks, generative code and math evaluation, and analysis of reliability under computational constraints. A comprehensive understanding of this phenomenon integrates statistical theory, algorithmic strategies, limitations, and practical guidance, culminating in a sharp characterization of the trade-offs and pitfalls underlying inference scaling for LLMs and other generative systems.

1. Mathematical Formulation of Pass@kk and Scaling Behavior

The core definition for pass@kk on a given input xx is: pass@k(x)=1(nck)(nk)\mathrm{pass}@k(x) = 1 - \frac{\binom{n-c}{k}}{\binom{n}{k}} where nn is the number of candidate generations and kk0 the number of correct samples among them (as per a reference solution or automated checker). For repeated sampling with replacement and independent per-sample accuracy kk1, the more familiar form is: kk2 Averaged over the dataset kk3, the empirical estimator is: kk4 This functional form yields a monotonic, concave curve in kk5, exhibiting diminishing returns as kk6 increases; specifically, the marginal gain is determined by the tail of the per-problem probability distribution kk7. As kk8, for discrete answer spaces, kk9 converges to the fraction of tasks with nonzero single-trial accuracy, regardless of the reliability of these probabilities (Dragoi et al., 9 Oct 2025).

To capture the scaling law more generally, models such as the Beta prior framework posit a distribution kk0. The expected failure rate after kk1 samples then decays as a power-law: kk2 where the exponent kk3 is fitted empirically and controls the asymptotic decay rate (Levi, 2024).

2. Statistical Theory and Robust Prediction of Pass@kk4 at Scale

Forecasting pass@kk5 at large kk6 from limited samples presents significant statistical challenges. Naive power-law fits or direct extrapolation are biased due to deterministic sample dependence and heteroskedasticity. The recommended approach is to fit a Beta-Binomial model to observed kk7 (successes, trials) per task: kk8 with MLE estimation for kk9. The predictive pass@kk0 is then computed as: kk1 Bootstrap resampling yields finite-sample confidence intervals. Dynamic sampling (allocating more attempts to hardest problems) further reduces estimation variance (Kazdan et al., 6 Oct 2025).

It follows that for large kk2, pass@kk3 for models with even minuscule kk4 on many problems will approach 1—a degeneracy that reveals fundamental limitations of pass@kk5 as a metric of robust reasoning.

3. Algorithmic Strategies for Inference Scaling

3.1 Reinforcement Learning with Verifiable Rewards and Pass@kk6 Optimization

Conventional RL with verifiable rewards (RLVR) and PPO-style updates incentivize high probability on the top-1 response (over-concentration), leading to improved pass@kk7 but degraded pass@kk8 for kk9 (Peng et al., 16 Oct 2025). SimKO (Simple Pass@kk0 Optimization) modifies the importance ratio in RLVR by applying (i) top-kk1 probability boosts to verified-correct responses at high-entropy tokens and (ii) steeper penalties to overconfident top-1 tokens in incorrect responses: kk2

kk3

This asymmetric smoothing lifts the entire pass@kk4 curve, with especially pronounced gains at high kk5 (+4.4pp at kk6 on math benchmarks), and does so without sacrificing output fluency (see tuning heuristics in section 6).

Pass@kk7 policy optimization (PKPO) directly targets the gradient of pass@kk8 using unbiased efficient estimators for both binary and continuous rewards, further reducing variance through refined baselines. Empirically, annealing kk9 during training enables concurrent optimization of pass@kk0 and pass@kk1 (Walder et al., 21 May 2025).

3.2 Efficient Inference-Time Scaling Methods

Sampling-based scaling can be optimized through several methods:

  • MatryoshkaThinking recursively combines generation, self-verification, and summarization, efficiently retaining correct subtraces while reducing token costs by over 20× relative to deep ensemble methods, with nearly identical pass@kk2 outcomes (Chen et al., 11 Oct 2025).
  • Entropy-Aware Generation (EAGer) adaptively branches at high-entropy decision points in the output sequence, allocating the sample budget dynamically—yielding equivalent or superior pass@kk3 with up to 65% fewer tokens used versus full parallel sampling (Scalena et al., 13 Oct 2025).
  • Diversified Sampling (DivSampling) draws completions from perturbed prompts or problem augmentations, increasing sample diversity; under mild conditions, the failure probability decreases strictly faster than for repeated sampling on a static prompt, especially benefiting harder tasks and higher kk4 (Wang et al., 16 Feb 2025).

Efficient algorithmic strategies also exist for batch completion: superposed decoding achieves kk5 completions in approximately constant time per token, as opposed to the baseline kk6 cost (Shen et al., 2024).

3.3 Minimax-Optimal Selection

Neither majority voting (majority frequency) nor Best-of-kk7 (top reward) is scaling-monotonic or minimax optimal for pass@kk8. Best-of-Majority (BoM), which restricts selection to high-frequency outputs before applying the reward model, is minimax-optimal and uniquely ensures regret decays as kk9 while avoiding reward hacking at large kk0 (Di et al., 3 Oct 2025).

4. Reliability, Breadth-Depth Trade-offs, and Metric Limitations

Pass@kk1 at large kk2 tends to overstate model capability by capturing "random guessing" rather than genuine reliability. For discrete answer spaces, pass@kk3 as soon as kk4 for a given problem. To explicitly distinguish between breadth (number of reachably solvable tasks) and depth (reliability/consistency), Cover@kk5 is introduced: kk6 Cover@kk7 tracks the fraction of tasks solved at per-sample reliability at least kk8. Pass@kk9 is a Beta-weighted average of Cover@xx0 with mass concentrated at low xx1 as xx2 increases, making it a poor proxy for depth. Leaderboard rankings fundamentally shift when Cover metrics are used: exploration-promoting methods can achieve broader and more reliable coverage than pure exploiters (Dragoi et al., 9 Oct 2025).

5. Inference Scaling in Continuous Space Reasoning

Dropout-based sampling with continuous latent trajectories (as in COCONUT) enables diverse sample generation and rising pass@xx3 with xx4. However, standard PRM and ORM reward models fail to effectively discriminate correct from incorrect continuous "thought spaces" due to geometric and dynamic homogeneity. Key metrics—IsoScore*, Hoyer sparsity, trajectory curvature—show minimal separation between correct and incorrect trajectories, and even small Gaussian perturbations yield only slight accuracy drop. This suggests the need for training-time inductive biases (e.g., isotropy regularization, contrastive trajectory objectives) specifically tailored for continuous latent spaces to enable effective pass@xx5 scaling and reranking (Wang et al., 14 Oct 2025).

6. False Positives and Ultimate Scaling Limits

Inference scaling methods (Best-of-xx6, self-consistency, tree-search) can only increase the chance of obtaining an answer that passes automated checks; they do not mitigate the prevalence of "false positives"—outputs with correct final answers but flawed reasoning. Empirically, even with xx7, the gap between automated pass@xx8 and human-validated accuracy remains ≈20–30 percentage points, and the scaling exponent halves for manual accuracy versus automated. This ceiling is due to the persistence of structural reasoning errors across samples, which raw sampling cannot overcome. Thus, inference scaling is fundamentally limited by model deductive weaknesses, not just search coverage (Wang et al., 10 Feb 2025).

7. Practical Recommendations and Cost Trade-offs

  • To maximize pass@xx9 per computational cost, employ dynamic or entropy-aware sampling, sample diversity methods, and minimax-optimal selection schemes (BoM).
  • For applications with reliability requirements, evaluate models on Cover@pass@k(x)=1(nck)(nk)\mathrm{pass}@k(x) = 1 - \frac{\binom{n-c}{k}}{\binom{n}{k}}0 at application-relevant thresholds.
  • In large-scale systems, robust estimation of rare event scaling (e.g., exploit/jailbreak rates) must use beta-binomial fitting and dynamic sampling to minimize risk estimation error for large pass@k(x)=1(nck)(nk)\mathrm{pass}@k(x) = 1 - \frac{\binom{n-c}{k}}{\binom{n}{k}}1, rather than pure curve extrapolation (Kazdan et al., 6 Oct 2025).
  • Sampling budgets above pass@k(x)=1(nck)(nk)\mathrm{pass}@k(x) = 1 - \frac{\binom{n-c}{k}}{\binom{n}{k}}2 see sharply diminishing returns, and further increases often overstate practical capability due to false positives and unreliable outputs.
  • Hardware-efficient architectures (e.g., Mamba distilled reasoners) combined with optimized selection can surpass more accurate but slower models under fixed compute or latency budgets (Paliotta et al., 27 Feb 2025).

8. Theoretical and Empirical Integration with Neural Scaling Laws

The functional form pass@k(x)=1(nck)(nk)\mathrm{pass}@k(x) = 1 - \frac{\binom{n-c}{k}}{\binom{n}{k}}3 for inference loss mirrors neural scaling laws for model/data scaling, and can be directly connected to inference-phase resource allocation: pass@k(x)=1(nck)(nk)\mathrm{pass}@k(x) = 1 - \frac{\binom{n-c}{k}}{\binom{n}{k}}4 where pass@k(x)=1(nck)(nk)\mathrm{pass}@k(x) = 1 - \frac{\binom{n-c}{k}}{\binom{n}{k}}5 is total compute, pass@k(x)=1(nck)(nk)\mathrm{pass}@k(x) = 1 - \frac{\binom{n-c}{k}}{\binom{n}{k}}6 prompt tokens, pass@k(x)=1(nck)(nk)\mathrm{pass}@k(x) = 1 - \frac{\binom{n-c}{k}}{\binom{n}{k}}7 decode tokens. This framework enables global optimization over training and inference compute investments for desired accuracy targets (Levi, 2024).


This synthesis establishes pass@pass@k(x)=1(nck)(nk)\mathrm{pass}@k(x) = 1 - \frac{\binom{n-c}{k}}{\binom{n}{k}}8 inference scaling as a multidimensional phenomenon governed by model statistics, sampling and selection strategies, metric limitations, and application-specific trade-offs, with substantial implications for both capability evaluation and responsible deployment of LLMs.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Pass@$k$ Inference Scaling.