Papers
Topics
Authors
Recent
Search
2000 character limit reached

Pass@$k$ Failure Rate in LLM Evaluation

Updated 14 January 2026
  • Pass@$k$ failure rate is a key metric that quantifies the probability that all k model attempts fail, reflecting model coverage and reliability.
  • The metric is mathematically defined to show exponential decay with increased sampling and serves as a diagnostic for exploration collapse in reinforcement learning.
  • Mitigation methods like SimKO empirically reduce failure rates for small k values, enhancing output diversity and reliability in tasks such as code generation.

The Pass@kk failure rate is a key metric for evaluating the probabilistic success or failure of machine learning models, especially LLMs, in tasks such as code generation, mathematical reasoning, and logic where the outcome is verifiable. The metric quantifies the likelihood that none out of kk independent model-generated solutions to a given task are correct, encoding both the coverage and reliability of a model under a finite sampling budget.

1. Formal Definition and Mathematical Formulation

Given a model with an (unknown) per-trial success probability pi∈[0,1]p_i \in [0,1] for each of TT tasks, the Pass@kk metric is defined as the expected fraction of problems solved by obtaining at least one correct solution in kk independent samples:

Pass@k=1T∑i=1T[1−(1−pi)k].\mathrm{Pass}@k = \frac{1}{T} \sum_{i=1}^T \left[ 1 - (1 - p_i)^k \right].

The Pass@kk failure rate, denoted Failure@k\mathrm{Failure}@k, is the complementary probability that all kk samples fail:

kk0

In code/NLG settings with kk1 candidates and kk2 correct, for any kk3,

kk4

representing the probability that none of the kk5 sampled programs are correct (Lyu et al., 2024).

2. Statistical Properties and Limiting Behavior

For kk6, kk7 decays exponentially as kk8 increases, i.e., kk9 as pi∈[0,1]p_i \in [0,1]0. Consequently, for any task where the model has nonzero probability of success, pi∈[0,1]p_i \in [0,1]1 and pi∈[0,1]p_i \in [0,1]2 in the large-sample limit, regardless of how small pi∈[0,1]p_i \in [0,1]3 is. On tasks with discrete or small-support output spaces, this implies that performance metrics at large pi∈[0,1]p_i \in [0,1]4 may reflect only the existence of nonzero probability mass on the correct answer rather than substantive reasoning ability. This effect can inflate apparent reasoning boundaries and mask brittle model behavior (Dragoi et al., 9 Oct 2025).

3. Diagnostic Role and Exploration Collapse

Pass@pi∈[0,1]p_i \in [0,1]5 failure rate serves as a diagnostic tool for latent coverage and diversity in the model’s output distribution. However, analysis shows that optimizing Pass@pi∈[0,1]p_i \in [0,1]6 directly as a reinforcement learning (RL) objective provides no additional exploration signal over Pass@pi∈[0,1]p_i \in [0,1]7, because its gradient is collinear with that of Pass@pi∈[0,1]p_i \in [0,1]8:

pi∈[0,1]p_i \in [0,1]9

where TT0 (Yu, 20 Nov 2025). In low-success regimes (TT1), the gradient vanishes, while in high-success regimes (TT2), TT3. Also, RL policies tend to concentrate probability mass on a single mode, causing multi-sample gains to collapse (TT4 for a small residual error TT5), and further draws add little coverage. Thus, multi-sample utility diminishes under exploitation-biased training and exploration collapse.

4. Impact of Training Dynamics and Mitigation Methods

Standard RLVR methods such as GRPO and PPO increase Pass@TT6 but often do so at the expense of higher failure rates for TT7, owing to over-concentration of probability on the top-1 candidate and suppression of alternative modes:

  • Token-level metrics show TT8 (dominant token), TT9 (all probability mass squeezed out) (Peng et al., 16 Oct 2025).
  • Stronger concentration yields higher kk0 for kk1.

Mitigation strategies such as SimKO introduce asymmetric updates: for correct responses, positive gradient is shared among the top-kk2 tokens; for incorrect responses, a stronger penalty is applied to the top-1 token, but only at high-entropy ("forking") steps. This reduces fail@kk3 by maintaining greater posterior support over multiple valid continuations. Empirical data (Qwen2.5-Math-7B, kk4) show, for kk5:

  • GRPO: kk6
  • SimKO: kk7 (a reduction of kk8 percentage points, or kk9 relative) Similar benefits are reported for larger kk0 and on other benchmarks (Peng et al., 16 Oct 2025).
K GRPO fail@K SimKO fail@K Reduction (pp, %)
1 58.3% 56.6% 1.7, 2.9%
8 41.9% 38.6% 3.3, 7.9%
256 23.9% 19.5% 4.4, 18.4%

5. Alternative Breadth-Depth Metrics: Cover@kk1 and Failure@kk2

Pass@kk3 and its failure rate can be interpreted as a Beta(kk4)-weighted average over the "breadth-depth" curve Cover@kk5, which measures the fraction of tasks for which the per-trial success rate meets or exceeds a reliability threshold kk6:

kk7

Accordingly,

kk8

Large-kk9 Pass@Pass@k=1T∑i=1T[1−(1−pi)k].\mathrm{Pass}@k = \frac{1}{T} \sum_{i=1}^T \left[ 1 - (1 - p_i)^k \right].0 places all weight near Pass@k=1T∑i=1T[1−(1−pi)k].\mathrm{Pass}@k = \frac{1}{T} \sum_{i=1}^T \left[ 1 - (1 - p_i)^k \right].1, so the resulting curves overestimate model reasoning boundaries by counting essentially any nonzero chance of success. By contrast, plotting Failure@Pass@k=1T∑i=1T[1−(1−pi)k].\mathrm{Pass}@k = \frac{1}{T} \sum_{i=1}^T \left[ 1 - (1 - p_i)^k \right].2 for high Pass@k=1T∑i=1T[1−(1−pi)k].\mathrm{Pass}@k = \frac{1}{T} \sum_{i=1}^T \left[ 1 - (1 - p_i)^k \right].3 reveals tasks on which model coverage collapses as the threshold tightens. This distinguishes genuine reasoning skill (consistent reliability) from random guessing. Empirical studies (OMEGA, Reasoning Gym) show that base models may outperform exploration-regularized models only for large Pass@k=1T∑i=1T[1−(1−pi)k].\mathrm{Pass}@k = \frac{1}{T} \sum_{i=1}^T \left[ 1 - (1 - p_i)^k \right].4 (random-guessing regime), while the latter maintain superior coverage for high-reliability thresholds (Dragoi et al., 9 Oct 2025).

6. Applications in Code Generation and Ranking

Failure@k has practical implications in scenarios such as code generation, where the objective is to maximize the likelihood that a user identifies a correct solution within their top-Pass@k=1T∑i=1T[1−(1−pi)k].\mathrm{Pass}@k = \frac{1}{T} \sum_{i=1}^T \left[ 1 - (1 - p_i)^k \right].5 reviewed candidates. Methods like Top Pass optimize the Pass@Pass@k=1T∑i=1T[1−(1−pi)k].\mathrm{Pass}@k = \frac{1}{T} \sum_{i=1}^T \left[ 1 - (1 - p_i)^k \right].6 loss directly through neural ranking, systematically minimizing Failure@Pass@k=1T∑i=1T[1−(1−pi)k].\mathrm{Pass}@k = \frac{1}{T} \sum_{i=1}^T \left[ 1 - (1 - p_i)^k \right].7 at small k (e.g., 1–10) — a regime aligned with real-world user tolerance for reviewing multiple code suggestions. For instance, on CodeContests (n=200, DeepSeek-Coder), Top Pass reduces the one-shot failure rate from 92.7% (random order) to 90.3%, and consistently delivers higher coverage across k (Lyu et al., 2024).

Method pass@1 fail@1
Random (DeepSeek) 5.2% 94.8%
Top Pass 9.7% 90.3%

7. Best Practices and Interpretive Guidance

Empirical and theoretical analyses warn against reporting Pass@Pass@k=1T∑i=1T[1−(1−pi)k].\mathrm{Pass}@k = \frac{1}{T} \sum_{i=1}^T \left[ 1 - (1 - p_i)^k \right].8 or Failure@Pass@k=1T∑i=1T[1−(1−pi)k].\mathrm{Pass}@k = \frac{1}{T} \sum_{i=1}^T \left[ 1 - (1 - p_i)^k \right].9 in isolation, especially at large sampling budgets where degeneracy and random guessing can distort true model capabilities. Recommended best practices include:

  • Reporting both Pass@kk0, Failure@kk1 and the full breadth-depth curve via Cover@kk2/Failure@kk3.
  • Aggregating coverage at multiple kk4 values (e.g., 0.2, 0.8) or via area-under-curve (AUCkk5) to robustly assess reasoning reliability (Dragoi et al., 9 Oct 2025).
  • Avoiding the use of Pass@kk6 as the sole RL optimization target; instead, employ objectives or regularizers that explicitly promote solution diversity and healthy exploration (e.g., SimKO, PKPO, KL-regularization) (Yu, 20 Nov 2025, Peng et al., 16 Oct 2025).

This approach grants a more rigorous and nuanced characterization of both "breadth" (fraction of problems ever solved) and "depth" (fraction consistently solved) in model evaluation, and mitigates overestimation due to chance success.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Pass@$k$ Failure Rate.