Pass@$k$ Failure Rate in LLM Evaluation

Updated 14 January 2026

Pass@$k$ failure rate is a key metric that quantifies the probability that all k model attempts fail, reflecting model coverage and reliability.
The metric is mathematically defined to show exponential decay with increased sampling and serves as a diagnostic for exploration collapse in reinforcement learning.
Mitigation methods like SimKO empirically reduce failure rates for small k values, enhancing output diversity and reliability in tasks such as code generation.

The Pass@ $k$ failure rate is a key metric for evaluating the probabilistic success or failure of machine learning models, especially LLMs, in tasks such as code generation, mathematical reasoning, and logic where the outcome is verifiable. The metric quantifies the likelihood that none out of $k$ independent model-generated solutions to a given task are correct, encoding both the coverage and reliability of a model under a finite sampling budget.

1. Formal Definition and Mathematical Formulation

Given a model with an (unknown) per-trial success probability $p_i \in [0,1]$ for each of $T$ tasks, the Pass@ $k$ metric is defined as the expected fraction of problems solved by obtaining at least one correct solution in $k$ independent samples:

$\mathrm{Pass}@k = \frac{1}{T} \sum_{i=1}^T \left[ 1 - (1 - p_i)^k \right].$

The Pass@ $k$ failure rate, denoted $\mathrm{Failure}@k$ , is the complementary probability that all $k$ samples fail:

$k$ 0

In code/NLG settings with $k$ 1 candidates and $k$ 2 correct, for any $k$ 3,

$k$ 4

representing the probability that none of the $k$ 5 sampled programs are correct (Lyu et al., 2024).

2. Statistical Properties and Limiting Behavior

For $k$ 6, $k$ 7 decays exponentially as $k$ 8 increases, i.e., $k$ 9 as $p_i \in [0,1]$ 0. Consequently, for any task where the model has nonzero probability of success, $p_i \in [0,1]$ 1 and $p_i \in [0,1]$ 2 in the large-sample limit, regardless of how small $p_i \in [0,1]$ 3 is. On tasks with discrete or small-support output spaces, this implies that performance metrics at large $p_i \in [0,1]$ 4 may reflect only the existence of nonzero probability mass on the correct answer rather than substantive reasoning ability. This effect can inflate apparent reasoning boundaries and mask brittle model behavior (Dragoi et al., 9 Oct 2025).

3. Diagnostic Role and Exploration Collapse

Pass@ $p_i \in [0,1]$ 5 failure rate serves as a diagnostic tool for latent coverage and diversity in the model’s output distribution. However, analysis shows that optimizing Pass@ $p_i \in [0,1]$ 6 directly as a reinforcement learning (RL) objective provides no additional exploration signal over Pass@ $p_i \in [0,1]$ 7, because its gradient is collinear with that of Pass@ $p_i \in [0,1]$ 8:

$p_i \in [0,1]$ 9

where $T$ 0 (Yu, 20 Nov 2025). In low-success regimes ( $T$ 1), the gradient vanishes, while in high-success regimes ( $T$ 2), $T$ 3. Also, RL policies tend to concentrate probability mass on a single mode, causing multi-sample gains to collapse ( $T$ 4 for a small residual error $T$ 5), and further draws add little coverage. Thus, multi-sample utility diminishes under exploitation-biased training and exploration collapse.

4. Impact of Training Dynamics and Mitigation Methods

Standard RLVR methods such as GRPO and PPO increase Pass@ $T$ 6 but often do so at the expense of higher failure rates for $T$ 7, owing to over-concentration of probability on the top-1 candidate and suppression of alternative modes:

Token-level metrics show $T$ 8 (dominant token), $T$ 9 (all probability mass squeezed out) (Peng et al., 16 Oct 2025).
Stronger concentration yields higher $k$ 0 for $k$ 1.

Mitigation strategies such as SimKO introduce asymmetric updates: for correct responses, positive gradient is shared among the top- $k$ 2 tokens; for incorrect responses, a stronger penalty is applied to the top-1 token, but only at high-entropy ("forking") steps. This reduces fail@ $k$ 3 by maintaining greater posterior support over multiple valid continuations. Empirical data (Qwen2.5-Math-7B, $k$ 4) show, for $k$ 5:

GRPO: $k$ 6
SimKO: $k$ 7 (a reduction of $k$ 8 percentage points, or $k$ 9 relative) Similar benefits are reported for larger $k$ 0 and on other benchmarks (Peng et al., 16 Oct 2025).

K	GRPO fail@K	SimKO fail@K	Reduction (pp, %)
1	58.3%	56.6%	1.7, 2.9%
8	41.9%	38.6%	3.3, 7.9%
256	23.9%	19.5%	4.4, 18.4%

5. Alternative Breadth-Depth Metrics: Cover@ $k$ 1 and Failure@ $k$ 2

Pass@ $k$ 3 and its failure rate can be interpreted as a Beta( $k$ 4)-weighted average over the "breadth-depth" curve Cover@ $k$ 5, which measures the fraction of tasks for which the per-trial success rate meets or exceeds a reliability threshold $k$ 6:

$k$ 7

Accordingly,

$k$ 8

Large- $k$ 9 Pass@ $\mathrm{Pass}@k = \frac{1}{T} \sum_{i=1}^T \left[ 1 - (1 - p_i)^k \right].$ 0 places all weight near $\mathrm{Pass}@k = \frac{1}{T} \sum_{i=1}^T \left[ 1 - (1 - p_i)^k \right].$ 1, so the resulting curves overestimate model reasoning boundaries by counting essentially any nonzero chance of success. By contrast, plotting Failure@ $\mathrm{Pass}@k = \frac{1}{T} \sum_{i=1}^T \left[ 1 - (1 - p_i)^k \right].$ 2 for high $\mathrm{Pass}@k = \frac{1}{T} \sum_{i=1}^T \left[ 1 - (1 - p_i)^k \right].$ 3 reveals tasks on which model coverage collapses as the threshold tightens. This distinguishes genuine reasoning skill (consistent reliability) from random guessing. Empirical studies (OMEGA, Reasoning Gym) show that base models may outperform exploration-regularized models only for large $\mathrm{Pass}@k = \frac{1}{T} \sum_{i=1}^T \left[ 1 - (1 - p_i)^k \right].$ 4 (random-guessing regime), while the latter maintain superior coverage for high-reliability thresholds (Dragoi et al., 9 Oct 2025).

6. Applications in Code Generation and Ranking

Failure@k has practical implications in scenarios such as code generation, where the objective is to maximize the likelihood that a user identifies a correct solution within their top- $\mathrm{Pass}@k = \frac{1}{T} \sum_{i=1}^T \left[ 1 - (1 - p_i)^k \right].$ 5 reviewed candidates. Methods like Top Pass optimize the Pass@ $\mathrm{Pass}@k = \frac{1}{T} \sum_{i=1}^T \left[ 1 - (1 - p_i)^k \right].$ 6 loss directly through neural ranking, systematically minimizing Failure@ $\mathrm{Pass}@k = \frac{1}{T} \sum_{i=1}^T \left[ 1 - (1 - p_i)^k \right].$ 7 at small k (e.g., 1–10) — a regime aligned with real-world user tolerance for reviewing multiple code suggestions. For instance, on CodeContests (n=200, DeepSeek-Coder), Top Pass reduces the one-shot failure rate from 92.7% (random order) to 90.3%, and consistently delivers higher coverage across k (Lyu et al., 2024).

Method	pass@1	fail@1
Random (DeepSeek)	5.2%	94.8%
Top Pass	9.7%	90.3%

7. Best Practices and Interpretive Guidance

Empirical and theoretical analyses warn against reporting Pass@ $\mathrm{Pass}@k = \frac{1}{T} \sum_{i=1}^T \left[ 1 - (1 - p_i)^k \right].$ 8 or Failure@ $\mathrm{Pass}@k = \frac{1}{T} \sum_{i=1}^T \left[ 1 - (1 - p_i)^k \right].$ 9 in isolation, especially at large sampling budgets where degeneracy and random guessing can distort true model capabilities. Recommended best practices include:

Reporting both Pass@ $k$ 0, Failure@ $k$ 1 and the full breadth-depth curve via Cover@ $k$ 2/Failure@ $k$ 3.
Aggregating coverage at multiple $k$ 4 values (e.g., 0.2, 0.8) or via area-under-curve (AUC $k$ 5) to robustly assess reasoning reliability (Dragoi et al., 9 Oct 2025).
Avoiding the use of Pass@ $k$ 6 as the sole RL optimization target; instead, employ objectives or regularizers that explicitly promote solution diversity and healthy exploration (e.g., SimKO, PKPO, KL-regularization) (Yu, 20 Nov 2025, Peng et al., 16 Oct 2025).

This approach grants a more rigorous and nuanced characterization of both "breadth" (fraction of problems ever solved) and "depth" (fraction consistently solved) in model evaluation, and mitigates overestimation due to chance success.