Pass-at-k Policy Optimization (PKPO)

Updated 9 February 2026

PKPO is a reinforcement learning framework that directly optimizes the pass@k metric to measure the probability of achieving at least one correct outcome over k attempts.
It employs policy gradient estimators and surrogate reward transformations to encourage diverse exploration while balancing exploitation in complex tasks.
By annealing the value of k and using techniques like SimKO, PKPO improves success rates in tasks such as code generation and mathematical reasoning.

Pass-at-k Policy Optimization (PKPO) refers to a family of @@@@1@@@@ algorithms and objective transformations designed to directly optimize the pass@k metric—that is, the probability that at least one out of k independent sampled trajectories from a stochastic policy yields a correct (verifiable) solution on complex tasks such as code generation and mathematical reasoning. PKPO encompasses both direct policy-gradient estimators for the true pass@k objective as well as surrogate and advantage-shaping algorithms that encourage sample-wise diversity and effective exploration. It is distinguished from standard RLVR (reinforcement learning with verifiable rewards), which traditionally focuses on pass@1—success probability under a single sample—potentially biasing learning toward exploitative, mode-concentrated policies and stalling on hard-exploration regimes.

1. The Pass@k Metric and Objective Formulation

The pass@k metric, widely adopted as a benchmark for LLM-based reasoning and code generation, measures the likelihood of success when drawing k independent samples from a policy. Formalized for a policy $\pi_\theta$ and reward function $f(x) \in \{0,1\}$ , pass@k is defined as

$\mathrm{pass}@k(\theta) = \mathbb{E}_{x_1,\dots,x_k \sim \pi_\theta} \biggl[1 - \prod_{i=1}^k (1 - f(x_i))\biggr].$

A direct RL objective integrating across a dataset $\mathcal{D}$ or distribution $P$ is

$J_k(\theta) = \mathbb{E}_{x \sim \mathcal{D}} \bigl[ 1 - (1 - J_1(x;\theta))^k \bigr ],$

where $J_1(x;\theta) = \mathbb{E}_{y \sim \pi_\theta(\cdot|x)} [ V(x, y) ]$ is the single-sample (pass@1) success rate (Walder et al., 21 May 2025, Thrampoulidis et al., 27 Oct 2025, Yu, 20 Nov 2025).

PKPO seeks to maximize $J_k(\theta)$ for user-specified $k$ , either throughout training or under a scheduled annealing regime.

2. Policy Gradient Estimation and Reward Transformation

Direct Gradient Computation

The gradient of the expected pass@k objective under population sampling is

$\nabla_\theta J_k(x; \theta) = \alpha_k(x, \theta) \cdot \nabla_\theta J_1(x; \theta)\,,$

where $\alpha_k(x, \theta) \equiv k(1 - J_1(x; \theta))^{k-1}$ . Applying REINFORCE,

$\nabla_\theta J_1(x;\theta) = \mathbb{E}_{y \sim \pi_\theta} \left[ V(x,y)\, \nabla_\theta \log \pi_\theta(y|x) \right ].$

Thus, optimizing $J_k$ through standard policy gradient methods is a per-example, strictly positive reweighting of the pass@1 update, with the reweighting diminishing as the policy becomes more accurate (Yu, 20 Nov 2025, Thrampoulidis et al., 27 Oct 2025).

Unbiased Monte Carlo Estimation

Given $n \geq k$ samples and $c$ observed successes, an unbiased estimator of pass@k is

$\rho(n, c, k) = 1 - \frac{\binom{n-c}{\,k\,}}{\binom{n}{\,k\,}},$

with the corresponding gradient estimator expressible as a sum of per-sample weights, efficiently computable for both binary and continuous-reward settings. All standard policy-gradient or actor–critic optimizers can utilize these transformed weights in place of raw rewards (Walder et al., 21 May 2025). In the continuous case, analogous estimators based on order statistics and combinatorial coefficients are efficiently realizable in $O(k+n \log n)$ time.

Surrogate and Advantage-Shaping Algorithms

Alternative (sometimes biased but lower-variance) approaches shape the per-sample advantage to track a surrogate function $F(\rho_K)$ of the empirical pass@k, e.g., the "by-dance" arcsin transformation $F(\rho_K) = \frac{2}{K}\arcsin(\sqrt{\rho_K})$ . These approaches encompass a range of practical updates, including reward-level regularization and hard-example up-weighting via e.g., augmenting with $\Omega(\rho) = \sqrt{\rho(1-\rho)}$ (Thrampoulidis et al., 27 Oct 2025).

PKPO is thus not a single algorithm but a general method for embedding pass@k optimization (either unbiased or via controlled surrogates) within the policy-gradient RL framework.

3. Exploration, Exploitation, and the Dynamics of PKPO

Increasing $k$ in the optimization target systematically encourages the stochastic policy to cover multiple solution modes, as maximizing the event "any of $k$ attempts is correct" biases against mode-collapse.

Analytic derivations show that, under hard tasks, PKPO's advantage function peaks its signal at intermediate reward rates, focusing gradient strength on difficult examples and supporting exploration via increased entropy (Chen et al., 14 Aug 2025). As $k$ grows, the per-sample advantage magnitude decreases, slowing learning but enhancing diversity among sampled outputs. Annealing $k$ —starting high (for exploration) then lowering to $k=1$ (for exploitation)—can achieve strong asymptotic pass@1 and pass@k concurrently (Walder et al., 21 May 2025).

However, PKPO based solely on direct pass@k gradients can encounter limitations: as $J_1(x;\theta) \to 1$ (high success) or $J_1(x;\theta) \to 0$ (low success), the learning signal vanishes, and the gradient's direction becomes collinear with that of pass@1, limiting the introduction of new search directions (Yu, 20 Nov 2025). Iterative RLVR commonly collapses the policy onto the dominant solution mode, causing pass@k $\to$ pass@1 and nullifying diversity gains.

4. Algorithmic Enhancements and Variants

SimKO: Simple Pass@K Optimization

SimKO introduces asymmetrical update rules at the token level to directly counter probability collapse and encourage persistent diffusion of probability mass (Peng et al., 16 Oct 2025). For correct trajectories, SimKO spreads the positive update across the top- $K$ most probable tokens (label smoothing), while for incorrect ones, it applies an amplified penalty to the top-1 token, crucially at high-entropy ("forking") decoding steps. This design empirically increases both pass@1 and pass@K across deep math and reasoning benchmarks, outperforming vanilla group-based PG baselines which typically favor exploitation alone.

Unification and Modular Recipe

Recent analysis unifies direct REINFORCE-style, reward-transformed, and advantage-shaped PKPO updates as instances of surrogate reward maximization, where the shaping and regularization terms inject appropriate exploration signal while controlling estimator variance (Thrampoulidis et al., 27 Oct 2025). Practical implementation proceeds by selecting a surrogate $F(\rho_K)$ and possibly an exploration-regularization $\Omega(\rho_K)$ , then computing the appropriately weighted policy gradient.

Approach	Estimator Type	Exploration Mechanism
Direct PKPO	Unbiased REINFORCE-based	Joint utility of k samples
Advantage-Shaped	Biased, low-variance	Surrogates encourage diversity
SimKO	Asymmetric importance/adv.	Top-K smoothing & anti-collapse

5. Empirical Outcomes and Practical Implementation

In empirical evaluation across a wide spectrum of RLVR tasks (math, code, logic), PKPO-based learning substantially increases cumulative pass@k metrics over pass@1-optimized baselines, particularly on hard benchmarks where mode-finding alone fails (Walder et al., 21 May 2025, Chen et al., 14 Aug 2025, Peng et al., 16 Oct 2025). Notably:

PKPO improves pass@k by 10–50 points and typically 2–8 points for pass@1.
Analytical advantage computation removes sampling variance and supports stable, cost-efficient training.
Integration is seamless in standard PPO/DAPO or actor–critic frameworks, requiring only replacement of scalar rewards with pass@k-shaped surrogates, and supports practical scheduling (annealing $k$ ).

Table: Performance Impact of PKPO and SimKO

Method	Pass@k Gain (typ.)	Pass@1 Gain (typ.)	Complexity Overhead
PKPO	+10–50 pts	+2–8 pts	$O(n)$ to $O(k+n\log n)$
SimKO	+1–5 pts (k=128+)	+0.5–2 pts	$O(n)$ , only at fork tokens

In every large-scale study, direct PKPO and/or SimKO outperformed classic RLVR and pass@1-optimized pipelines, demonstrating effectiveness at balancing exploration and exploitation, with no detrimental impact when annealed to $k=1$ for final exploitation.

6. Critical Insights, Limitations, and Recommendations

PKPO provides rigorous estimators and a theoretically-grounded framework for exploring the trade-off between diversity and exploitation in RLVR policy optimization (Yu, 20 Nov 2025, Thrampoulidis et al., 27 Oct 2025). Nevertheless:

The gradient of pass@k is always a reweighted version of the pass@1 gradient with no new search direction—thus, true exploration must be induced either through surrogate-based reward shaping, advantage scaling, or algorithmic variants foregrounding diversity.
In both low-success and high-success regimes, learning signal vanishes, and repeated RLVR can collapse exploration to top-1 modes, nullifying the potential benefit of pass@k training if not controlled by explicit diversity mechanisms.
Empirical evidence suggests that surrogate transformations, advantage shaping, and asymmetric update rules such as in SimKO are essential to actualize the potential of pass@k objectives in practical large-model optimization.

Recommendation: Pass@k should typically serve as an evaluation or diagnostic tool and a component of a more sophisticated RL training objective that incorporates explicit mechanisms for diversity promotion, such as entropy regularization, joint-sample estimators, or reward-level regularization. Where optimization is necessary, PKPO and its surrogates provide efficient, theoretically justified approaches for scalable exploration/exploitation balancing.

7. References

"Pass@k Metric for RLVR: A Diagnostic Tool of Exploration, But Not an Objective" (Yu, 20 Nov 2025)
"Pass@K Policy Optimization: Solving Harder Reinforcement Learning Problems" (Walder et al., 21 May 2025)
"Pass@k Training for Adaptively Balancing Exploration and Exploitation of Large Reasoning Models" (Chen et al., 14 Aug 2025)
"SimKO: Simple Pass@K Policy Optimization" (Peng et al., 16 Oct 2025)
"Advantage Shaping as Surrogate Reward Maximization: Unifying Pass@K Policy Gradients" (Thrampoulidis et al., 27 Oct 2025)