Papers
Topics
Authors
Recent
Search
2000 character limit reached

Pass@k Metrics for LLM Evaluation

Updated 1 November 2025
  • Pass@k is defined as the probability that at least one out of k independent samples is correct, making it essential for evaluating LLMs in verifiable tasks.
  • It enables practical assessment by validating multiple candidate outputs in scenarios such as code synthesis, math problem solving, and cybersecurity.
  • Recent studies optimize Pass@k using reinforcement learning and policy gradient methods while addressing challenges like sample inefficiency and model diversity.

Pass@kk is a performance metric extensively used for evaluating LLMs in domains where solution correctness is automatically verifiable, such as code synthesis, math problem solving, and cybersecurity. It measures the probability that at least one out of kk sampled responses provided by a model is correct for a given task. The metric is especially valuable for scenarios where users or downstream systems can validate multiple candidate outputs, providing a more realistic gauge of practical model utility in iterative or best-of-kk selection regimes.

1. Formal Definition and Mathematical Properties

Pass@kk is defined, for a per-sample success rate pp, as the probability that at least one of kk independent samples is successful:

Pass@k=1(1p)k\text{Pass@}k = 1 - (1-p)^k

For NN problems and kk solutions per problem, the empirical estimator (as used for code synthesis benchmarks) is:

Pass@k=1Ni=1NI[j=1ksolutionij passes all test cases]\text{Pass@}k = \frac{1}{N} \sum_{i=1}^N \mathbb{I}\left[\bigcup_{j=1}^k \text{solution}_{ij}\text{ passes all test cases}\right]

A combinatorial unbiased variant is:

kk0

where kk1 is the number of sampled solutions for the problem and kk2 is the count of correct solutions (Dalal et al., 19 May 2025).

Pass@kk3 amplifies the probability of success particularly for low kk4, making it a sensitive measure for improvements on hard problems where individual attempts are unlikely to succeed.

2. Evaluation and Estimation Strategies

In practical settings, Pass@kk5 can be evaluated by sampling kk6 outputs per task and measuring the fraction of tasks for which at least one is correct. In code synthesis and other verification-heavy domains, each output is typically validated against test cases. For tasks without explicit tests, surrogate metrics (e.g., CodeScore-R (Yang et al., 2024)) have been developed to approximate Pass@kk7 alignment using embedding-based contrastive representations.

Efficient estimator construction for Pass@kk8 and its gradients is critical for reliable measurement and optimization in reinforcement learning setups. Unbiased and low-variance estimators, such as those derivable from combinatorial statistics or leave-one-out baselines, have been established to ensure robust optimization and accurate tracking for arbitrary kk9 (Walder et al., 21 May 2025, Thrampoulidis et al., 27 Oct 2025).

3. Pass@kk0 in Reinforcement Learning and Policy Optimization

Pass@kk1 is not only a benchmark but has become a directly optimized objective in training highly capable LLMs. Standard RL methods, when maximizing expected reward independently per sample, tend to favor “exploitation”: they improve Pass@1 at the cost of model entropy and diversity, limiting Pass@kk2 for kk3 (Chen et al., 14 Aug 2025, Peng et al., 16 Oct 2025).

Recent work introduces explicit Pass@kk4 policy optimization (PKPO) in RLVR, which transforms the reward structure so that the optimization target is the set-level Pass@kk5 rather than isolated sample accuracy. This allows for direct maximization of the probability that at least one candidate in kk6 attempts is correct, encouraging diversity and exploration (Walder et al., 21 May 2025). Theoretical guarantees for unbiased gradient estimation and efficient computation support robust Pass@kk7 optimization at arbitrary kk8.

In practice, annealing kk9—starting with large kk0 (promoting exploration/diversity) before reducing to kk1 (focusing on exploitation)—allows simultaneous improvement of Pass@1 and Pass@kk2 (Walder et al., 21 May 2025), with empirical results indicating superior performance on hard analytical benchmarks.

4. Empirical Findings and Impact on Model Diversity

Pass@kk3 is sensitive to the diversity of solutions: a model that produces a variety of plausible outputs will typically exhibit higher Pass@kk4, all else being equal. RLVR methods focused only on Pass@1 often drive models into low-entropy, repetitive regimes—high confidence in a narrow solution set, resulting in entropy collapse and stagnation of Pass@kk5 at larger kk6 (Liang et al., 19 Aug 2025).

To address entropy collapse, approaches such as Self-play with Variational Synthesis (SvS) augment training data with diverse, semantically equivalent problem variants, sustaining diversity and yielding substantial Pass@kk7 improvements for large kk8 (absolute gains of 18–23% for kk9 on difficult math benchmarks) (Liang et al., 19 Aug 2025). Analogous methods, like SimKO (Peng et al., 16 Oct 2025), asymmetrically smooth token distributions and apply targeted penalization to prevent over-concentration, further improving Pass@pp0 without sacrificing Pass@1.

5. Inference, Ranking, and Coverage Considerations

During model inference, Pass@pp1 interacts with the design of candidate selection strategies. Standard practices such as majority voting and Best-of-N are theoretically suboptimal for scaling accuracy with pp2 and computational budget pp3, particularly under reward model noise or low reference policy coverage. The Best-of-Majority (BoM) algorithm achieves minimax-optimal regret scaling in Pass@pp4 inference with provable guarantees, combining frequency-based filtering with reward model ranking (Di et al., 3 Oct 2025).

The definition of coverage coefficient pp5, where pp6 is the reference-policy probability of optimal response, determines inference difficulty: models with poor coverage require larger pp7 to realize gains in Pass@pp8.

For usability, direct Pass@pp9-maximizing ranking objectives such as Top Pass have shown substantial gains over binary classification rankers (e.g., CodeRanker), particularly for top-1 and top-5 code selection in code generation tasks (Lyu et al., 2024).

6. Limitations, Extensions, and Statistical Pitfalls

Pass@kk0 faces several practical and theoretical limitations. In discrete answer spaces (e.g., math domains with small output sets), Pass@kk1 for large kk2 degenerates—eventually all tasks are “solved” by brute-force sampling, obscuring genuine reasoning performance and inflating reasoning boundaries (Dragoi et al., 9 Oct 2025). The metric is also sample-inefficient for model comparison and ranking, with high variance and instability for limited sample counts (Hariri et al., 5 Oct 2025).

Bayesian evaluation frameworks, employing posterior mean and credible intervals under a Dirichlet prior, yield more stable, transparent, and statistically interpretable comparisons, recommended as alternatives for both binary and graded rubrics (Hariri et al., 5 Oct 2025). For reliability assessment, Cover@kk3—fraction of problems solved reliably (kk4)—provides a finer-grained, reliability-aware perspective than Pass@kk5, disambiguating lucky guesses from robust reasoning (Dragoi et al., 9 Oct 2025).

Statistical estimation of Pass@kk6 scaling under computational constraints is nontrivial. Standard regression models and discretized beta-binomial methods exhibit bias, especially in the prediction of rare-event scaling. Likelihood-based beta-binomial estimation frameworks with dynamic problem-level sampling enable accurate extrapolation of Pass@kk7 for large kk8 or rare risks with minimal overhead (Kazdan et al., 6 Oct 2025).

7. Future Directions and Theoretical Generalization

Advantage shaping and surrogate reward maximization approaches unify Pass@kk9 policy optimization: both direct REINFORCE-style and implicit advantage-reweighting methods fundamentally correspond to the maximization of smooth nonlinear functions of the Pass@Pass@k=1(1p)k\text{Pass@}k = 1 - (1-p)^k0 reward, potentially regularized to emphasize hard examples or support exploration (Thrampoulidis et al., 27 Oct 2025). Theoretical recipes for constructing new policy gradient schemes follow from the choice of surrogate and regularization function.

Pass@Pass@k=1(1p)k\text{Pass@}k = 1 - (1-p)^k1 remains central for both evaluation and optimization in domains with verifiable correctness, but must be complemented by reliability-controlled, robust, and variance-aware methodologies as models and application settings scale. Methodological innovations, such as those described above, are likely to remain relevant as model capabilities advance and the diversity of outputs becomes both a core capability and a challenge for real-world deployment scenarios.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Pass@$k$ Metrics.