Pass@1 Metric Overview

Updated 23 January 2026

Pass@1 is defined as the probability that a single model output is correct, serving as a fundamental one-shot accuracy metric in various AI tasks.
It is estimated using independent Bernoulli trials with unbiased sampling methods and can incorporate Bayesian calibration to manage statistical variance.
In practice, Pass@1 is applied in reinforcement learning and code generation benchmarks to optimize performance while highlighting potential limitations in exploration diversity.

The Pass@1 metric is a standard quantitative measure assessing a model’s probability of producing a correct output in a single independent sampling attempt on a task. Primarily used for LLMs, syntactic program synthesis, and reinforcement learning with verifiable rewards (RLVR) in reasoning domains, Pass@1 captures the expected one-shot accuracy: the fraction of problems solved by a single sample per instance. It serves as both a fundamental evaluation criterion and a baseline against which multi-sample methods such as Pass@k are compared. As the simplest member of the Pass@k family, Pass@1 reveals core statistical properties about model success, learning signal, exploration, and limitations in both optimization and practical benchmarking contexts.

1. Formal Definition and Estimation

Let $x$ denote a prompt or problem, $a$ the reference answer, and $\pi_\theta(y \mid x)$ the conditional distribution given by the model with parameters $\theta$ . Consider a binary verifier $R(a, y) \in \{0, 1\}$ indicating correctness.

Definition:

$\mathrm{Pass@1} = \mathbb{E}_{(x,a)\sim D,\,y\sim\pi_\theta(\cdot|x)}\left[R(a, y)\right]$

That is, the probability that a single sample $y$ is correct for a random $(x,a) \sim D$ .

In code synthesis or analogous tasks, let $n$ independent model outputs yield $c$ correct outputs. The unbiased estimator is:

$a$ 0

This is the sample mean of success indicators, equivalently used in leaderboard reporting (Dalal et al., 19 May 2025, Dragoi et al., 9 Oct 2025, Lyu et al., 2024, Walder et al., 21 May 2025, Thrampoulidis et al., 27 Oct 2025).

In the general Pass@k family, the metric for $a$ 1 samples is:

$a$ 2

The $a$ 3 specialization reduces to $a$ 4, the per-attempt success probability (Dalal et al., 19 May 2025, Yu, 20 Nov 2025).

2. Statistical Properties, Model Coverage, and Interpretation

Pass@1 treats each sampling attempt as an independent Bernoulli trial. For a test set of $a$ 5 problems with per-problem success probabilities $a$ 6, it computes:

$a$ 7

which estimates the average probability that a single model sample is correct for a random problem (Dragoi et al., 9 Oct 2025).

Interpretation: Pass@1 represents the “reasoning boundary” at $a$ 8, i.e., the fraction of the dataset the model can solve by a single attempt with no reliability guarantee beyond that shot (Dragoi et al., 9 Oct 2025).
Maximum-likelihood and Bayesian approaches to estimation both yield Pass@1 as the point-estimate. For Bayesian calibration under a uniform Beta prior, the posterior mean is $a$ 9 for $\pi_\theta(y \mid x)$ 0 successes in $\pi_\theta(y \mid x)$ 1 draws, and is order-equivalent to empirical Pass@1 (Hariri et al., 5 Oct 2025).

3. Methodological Usage in RLVR and Code Generation

In RLVR, Pass@1 is the canonical verifiable-reward objective:

The reward signal for policy gradient is $\pi_\theta(y \mid x)$ 2, allowing direct maximization of expected correctness via REINFORCE or GRPO (Group-Relative Policy Optimization) (Thrampoulidis et al., 27 Oct 2025).
Gradient estimators typically use the per-sample log-probability, and variance-reduction via baselines or normalization (e.g., leave-one-out, mean normalization).

In benchmarking synthesizers (such as code generation), Pass@1 is the empirical fraction of test-cases where the top-1 candidate passes all verification checks (Lyu et al., 2024). For code generation, higher Pass@1 means a user is more likely to find a correct solution with minimal review effort.

4. Limitations, Shortcomings, and Alternative Metrics

Statistical and Practical Issues

Sampling noise: With small sample or test set size, Pass@1 is a high-variance estimator. Small differences do not imply meaningful differences in underlying model performance (Hariri et al., 5 Oct 2025).
Degeneracy at large $\pi_\theta(y \mid x)$ 3: Pass@k with $\pi_\theta(y \mid x)$ 4 approaches $\pi_\theta(y \mid x)$ 5 even for weak models when there is finite support in the answer space, undermining its utility as a true measure of model reasoning (Dragoi et al., 9 Oct 2025).
No reliability profile: Pass@1 does not capture the distribution of per-instance reliability; it represents the mean, not the wisdom-of-the-crowd threshold (Dragoi et al., 9 Oct 2025).

Exploration Collapse and Over-Conservatism

Empirical results show that using Pass@1 as a training reward leads to rapid “exploration collapse”: policies become over-conservative, focusing on a narrow set of easy or high-probability outputs, thus reducing sample entropy and overall coverage (Chen et al., 14 Aug 2025, Walder et al., 21 May 2025, Yu, 20 Nov 2025).

Table: Metric Comparisons

Metric	Measures	Reliable at Large $\pi_\theta(y \mid x)$ 6?
Pass@1	One-shot success rate	Yes
Pass@k	≥1 correct in $\pi_\theta(y \mid x)$ 7 draws	No (degenerates as $\pi_\theta(y \mid x)$ 8)
Cover@ $\pi_\theta(y \mid x)$ 9	Fraction at ≥ $\theta$ 0 prob	Yes

Cover@ $\theta$ 1 profiles, as introduced in (Dragoi et al., 9 Oct 2025), offer a full distributional view of per-problem reliability, maintaining informativeness at all coverage thresholds, unlike Pass@k.

5. Gradient Dynamics and Optimization Analysis

Gradient expressions for Pass@1 and Pass@k directly illustrate their learning dynamics:

The Pass@1 policy gradient for RLVR is

$\theta$ 2

This yields a stable, non-reweighted signal: whenever a correct sample is found, the learning step is of constant magnitude (Yu, 20 Nov 2025).

In contrast, Pass@k gradients reweight the Pass@1 gradient:

$\theta$ 3

The direction is identical; only scalar magnitude differs (Yu, 20 Nov 2025).

As training progresses and the policy concentrates (exploration collapse), the marginal value of multi-sample increases vanishes and $\theta$ 4. Thus, Pass@1 is the irreducible core dimension along which all pass-based objectives move (Walder et al., 21 May 2025, Thrampoulidis et al., 27 Oct 2025).

6. Practical Considerations, Evaluation Protocols, and Recommendations

Pass@1 remains the canonical, minimal-overhead metric for both RLVR training and model evaluation, and is widely reported in code generation challenges (MBPP, HumanEval, etc.) (Lyu et al., 2024).
Bayesian analysis (posterior mean and credible intervals) is recommended for robust evaluation and model ranking at small sample sizes, clarifying statistical significance and uncertainty—not merely point estimates (Hariri et al., 5 Oct 2025).
Optimal training protocols can use Pass@1 at the end of annealed multi-sample exploration (i.e., begin training for exploration with Pass@k $\theta$ 5, then anneal to $\theta$ 6 for final single-sample precision) (Walder et al., 21 May 2025).
Pass@k and Pass@1 are best used together: Pass@1 for faithful one-shot evaluation, Pass@k for diagnostic insight into model diversity and latent coverage.
Direct optimization of surrogate objectives or using advantage-shaping regularizations can alleviate overfitting to easy cases and encourage coverage in Pass@1 training (Thrampoulidis et al., 27 Oct 2025).

7. Extensions and Theoretical Role in Model Assessment

Recent work proposes alternatives that resolve known limitations of Pass@1:

Cover@ $\theta$ 7 provides reliability-aware coverage, plotting the breadth–depth curve of model accuracy at all confidence thresholds. Pass@1 emerges as the area under this curve, but high-coverage reliability requires more than maximizing the mean (Dragoi et al., 9 Oct 2025).
Bayesian model evaluation unifies single-sample and multi-sample reporting, enables uncertainty quantification, and stops over-interpretation of minor Pass@1 differences (Hariri et al., 5 Oct 2025).

The role of Pass@1 thus is twofold: it is (1) the foundational metric for one-shot correctness, and (2) the anchor point for all richer exploration and diversity-based metrics. Theoretical analysis shows that when used as a standalone optimization objective, it provides a stable but possibly overly exploitative gradient, suggesting that robust training and model assessment should combine Pass@1 with auxiliary metrics or reward-shaping regimes to ensure statistical reliability and exploration capability (Chen et al., 14 Aug 2025, Walder et al., 21 May 2025, Yu, 20 Nov 2025, Thrampoulidis et al., 27 Oct 2025).