Effect of GRPO Weighting Inversion on Distribution Sharpening

Determine whether the inversion of the population-level weighting function in Group Relative Policy Optimization (GRPO), which increases as the pass rate approaches one, contributes to distribution sharpening when training datasets contain a substantial fraction of very easy inputs.

Background

The paper introduces a population-level weight-function view that expresses gradients for various objectives, including reinforcement learning (RL), GRPO, MaxRL, and maximum likelihood (ML), in the form E_x[w(p_theta(x)) ∇_theta p_theta(x)]. In this view, the GRPO weighting function is w(p)=1/√(p(1−p)), which increases as p→1, implying relatively greater emphasis on very easy inputs compared to RL and likelihood-based objectives.

The authors note this inversion is qualitatively different from ML and MaxRL, which place more emphasis on low-pass-rate inputs. They conjecture that such inversion may be linked to empirically observed distribution sharpening (e.g., pass@k degradation) in RL with verifiable rewards when datasets include many easy examples and identify a detailed analysis of this effect as future work.

References

We conjecture that this inversion may contribute to distribution sharpening~\citep{yue2025doesreinforcementlearningreally,wu2026invisibleleashrlvrescape} when datasets contain a substantial fraction of overly easy inputs, and leave a detailed analysis to future work.

— Maximum Likelihood Reinforcement Learning (2602.02710 - Tajwar et al., 2 Feb 2026) in Section: A Unifying Weight-Function View (footnote in the paragraph discussing GRPO weighting)

Effect of GRPO Weighting Inversion on Distribution Sharpening

Background

References

Related Problems