Generalization of GRPO-specific off-policy heuristics beyond the GRPO loss

Determine how the off-policy stability heuristics used with Group Relative Policy Optimization (including importance sampling ratio clipping, deletion of tokens with extreme importance ratios, and discarding entire rollouts that are too off-policy) generalize to reinforcement learning post-training objectives other than the GRPO loss function for large language models.

Background

The paper discusses how practical RL post-training for LLMs is often off-policy due to mismatches between the trainer and inference engine. Many prior works attempt to mitigate this mismatch by augmenting GRPO with importance sampling and a variety of heuristics, such as clipping importance ratios, deleting tokens with extreme ratios, or discarding entire rollouts deemed too off-policy.

The authors note that these heuristics were designed and evaluated specifically within the GRPO objective and explicitly state uncertainty about their applicability beyond GRPO. This raises a concrete open question about whether and how such heuristics would function with other RL objectives for LLM post-training, such as off-policy squared-regression objectives like OAPL.

References

Since these heuristics are specifically designed for and tested under the GRPO loss, it is unclear how they generalize beyond the very specific GRPO loss function.

— LLMs Can Learn to Reason Via Off-Policy RL (2602.19362 - Ritter et al., 22 Feb 2026) in Section 2: Background (paragraph on importance sampling heuristics)

Generalization of GRPO-specific off-policy heuristics beyond the GRPO loss

Background

References

Related Problems