Pass@k Metrics for LLM Evaluation

Updated 1 November 2025

Pass@k is defined as the probability that at least one out of k independent samples is correct, making it essential for evaluating LLMs in verifiable tasks.
It enables practical assessment by validating multiple candidate outputs in scenarios such as code synthesis, math problem solving, and cybersecurity.
Recent studies optimize Pass@k using reinforcement learning and policy gradient methods while addressing challenges like sample inefficiency and model diversity.

Pass@ $k$ is a performance metric extensively used for evaluating LLMs in domains where solution correctness is automatically verifiable, such as code synthesis, math problem solving, and cybersecurity. It measures the probability that at least one out of $k$ sampled responses provided by a model is correct for a given task. The metric is especially valuable for scenarios where users or downstream systems can validate multiple candidate outputs, providing a more realistic gauge of practical model utility in iterative or best-of- $k$ selection regimes.

1. Formal Definition and Mathematical Properties

Pass@ $k$ is defined, for a per-sample success rate $p$ , as the probability that at least one of $k$ independent samples is successful:

$\text{Pass@}k = 1 - (1-p)^k$

For $N$ problems and $k$ solutions per problem, the empirical estimator (as used for code synthesis benchmarks) is:

$\text{Pass@}k = \frac{1}{N} \sum_{i=1}^N \mathbb{I}\left[\bigcup_{j=1}^k \text{solution}_{ij}\text{ passes all test cases}\right]$

A combinatorial unbiased variant is:

$k$ 0

where $k$ 1 is the number of sampled solutions for the problem and $k$ 2 is the count of correct solutions (Dalal et al., 19 May 2025).

Pass@ $k$ 3 amplifies the probability of success particularly for low $k$ 4, making it a sensitive measure for improvements on hard problems where individual attempts are unlikely to succeed.

2. Evaluation and Estimation Strategies

In practical settings, Pass@ $k$ 5 can be evaluated by sampling $k$ 6 outputs per task and measuring the fraction of tasks for which at least one is correct. In code synthesis and other verification-heavy domains, each output is typically validated against test cases. For tasks without explicit tests, surrogate metrics (e.g., CodeScore-R (Yang et al., 2024)) have been developed to approximate Pass@ $k$ 7 alignment using embedding-based contrastive representations.

Efficient estimator construction for Pass@ $k$ 8 and its gradients is critical for reliable measurement and optimization in reinforcement learning setups. Unbiased and low-variance estimators, such as those derivable from combinatorial statistics or leave-one-out baselines, have been established to ensure robust optimization and accurate tracking for arbitrary $k$ 9 (Walder et al., 21 May 2025, Thrampoulidis et al., 27 Oct 2025).

3. Pass@ $k$ 0 in Reinforcement Learning and Policy Optimization

Pass@ $k$ 1 is not only a benchmark but has become a directly optimized objective in training highly capable LLMs. Standard RL methods, when maximizing expected reward independently per sample, tend to favor “exploitation”: they improve Pass@1 at the cost of model entropy and diversity, limiting Pass@ $k$ 2 for $k$ 3 (Chen et al., 14 Aug 2025, Peng et al., 16 Oct 2025).

Recent work introduces explicit Pass@ $k$ 4 policy optimization (PKPO) in RLVR, which transforms the reward structure so that the optimization target is the set-level Pass@ $k$ 5 rather than isolated sample accuracy. This allows for direct maximization of the probability that at least one candidate in $k$ 6 attempts is correct, encouraging diversity and exploration (Walder et al., 21 May 2025). Theoretical guarantees for unbiased gradient estimation and efficient computation support robust Pass@ $k$ 7 optimization at arbitrary $k$ 8.

In practice, annealing $k$ 9—starting with large $k$ 0 (promoting exploration/diversity) before reducing to $k$ 1 (focusing on exploitation)—allows simultaneous improvement of Pass@1 and Pass@ $k$ 2 (Walder et al., 21 May 2025), with empirical results indicating superior performance on hard analytical benchmarks.

4. Empirical Findings and Impact on Model Diversity

Pass@ $k$ 3 is sensitive to the diversity of solutions: a model that produces a variety of plausible outputs will typically exhibit higher Pass@ $k$ 4, all else being equal. RLVR methods focused only on Pass@1 often drive models into low-entropy, repetitive regimes—high confidence in a narrow solution set, resulting in entropy collapse and stagnation of Pass@ $k$ 5 at larger $k$ 6 (Liang et al., 19 Aug 2025).

To address entropy collapse, approaches such as Self-play with Variational Synthesis (SvS) augment training data with diverse, semantically equivalent problem variants, sustaining diversity and yielding substantial Pass@ $k$ 7 improvements for large $k$ 8 (absolute gains of 18–23% for $k$ 9 on difficult math benchmarks) (Liang et al., 19 Aug 2025). Analogous methods, like SimKO (Peng et al., 16 Oct 2025), asymmetrically smooth token distributions and apply targeted penalization to prevent over-concentration, further improving Pass@ $p$ 0 without sacrificing Pass@1.

5. Inference, Ranking, and Coverage Considerations

During model inference, Pass@ $p$ 1 interacts with the design of candidate selection strategies. Standard practices such as majority voting and Best-of-N are theoretically suboptimal for scaling accuracy with $p$ 2 and computational budget $p$ 3, particularly under reward model noise or low reference policy coverage. The Best-of-Majority (BoM) algorithm achieves minimax-optimal regret scaling in Pass@ $p$ 4 inference with provable guarantees, combining frequency-based filtering with reward model ranking (Di et al., 3 Oct 2025).

The definition of coverage coefficient $p$ 5, where $p$ 6 is the reference-policy probability of optimal response, determines inference difficulty: models with poor coverage require larger $p$ 7 to realize gains in Pass@ $p$ 8.

For usability, direct Pass@ $p$ 9-maximizing ranking objectives such as Top Pass have shown substantial gains over binary classification rankers (e.g., CodeRanker), particularly for top-1 and top-5 code selection in code generation tasks (Lyu et al., 2024).

6. Limitations, Extensions, and Statistical Pitfalls

Pass@ $k$ 0 faces several practical and theoretical limitations. In discrete answer spaces (e.g., math domains with small output sets), Pass@ $k$ 1 for large $k$ 2 degenerates—eventually all tasks are “solved” by brute-force sampling, obscuring genuine reasoning performance and inflating reasoning boundaries (Dragoi et al., 9 Oct 2025). The metric is also sample-inefficient for model comparison and ranking, with high variance and instability for limited sample counts (Hariri et al., 5 Oct 2025).

Bayesian evaluation frameworks, employing posterior mean and credible intervals under a Dirichlet prior, yield more stable, transparent, and statistically interpretable comparisons, recommended as alternatives for both binary and graded rubrics (Hariri et al., 5 Oct 2025). For reliability assessment, Cover@ $k$ 3—fraction of problems solved reliably ( $k$ 4)—provides a finer-grained, reliability-aware perspective than Pass@ $k$ 5, disambiguating lucky guesses from robust reasoning (Dragoi et al., 9 Oct 2025).

Statistical estimation of Pass@ $k$ 6 scaling under computational constraints is nontrivial. Standard regression models and discretized beta-binomial methods exhibit bias, especially in the prediction of rare-event scaling. Likelihood-based beta-binomial estimation frameworks with dynamic problem-level sampling enable accurate extrapolation of Pass@ $k$ 7 for large $k$ 8 or rare risks with minimal overhead (Kazdan et al., 6 Oct 2025).

7. Future Directions and Theoretical Generalization

Advantage shaping and surrogate reward maximization approaches unify Pass@ $k$ 9 policy optimization: both direct REINFORCE-style and implicit advantage-reweighting methods fundamentally correspond to the maximization of smooth nonlinear functions of the Pass@ $\text{Pass@}k = 1 - (1-p)^k$ 0 reward, potentially regularized to emphasize hard examples or support exploration (Thrampoulidis et al., 27 Oct 2025). Theoretical recipes for constructing new policy gradient schemes follow from the choice of surrogate and regularization function.

Pass@ $\text{Pass@}k = 1 - (1-p)^k$ 1 remains central for both evaluation and optimization in domains with verifiable correctness, but must be complemented by reliability-controlled, robust, and variance-aware methodologies as models and application settings scale. Methodological innovations, such as those described above, are likely to remain relevant as model capabilities advance and the diversity of outputs becomes both a core capability and a challenge for real-world deployment scenarios.