Generalization of GRPO-specific off-policy heuristics beyond the GRPO loss
Determine how the off-policy stability heuristics used with Group Relative Policy Optimization (including importance sampling ratio clipping, deletion of tokens with extreme importance ratios, and discarding entire rollouts that are too off-policy) generalize to reinforcement learning post-training objectives other than the GRPO loss function for large language models.
References
Since these heuristics are specifically designed for and tested under the GRPO loss, it is unclear how they generalize beyond the very specific GRPO loss function.
— LLMs Can Learn to Reason Via Off-Policy RL
(2602.19362 - Ritter et al., 22 Feb 2026) in Section 2: Background (paragraph on importance sampling heuristics)