Does extended RL fine-tuning (PPO/GRPO) cause model collapse?

Determine whether longer training periods using reinforcement learning techniques such as Proximal Policy Optimization (PPO) or Group Relative Optimization (GRPO) lead to model collapse in large language models.

Background

The authors relate iterative deployment to reinforcement learning and discuss safety implications of implicit reward signals arising from user curation. In this context, they raise uncertainty about the stability of models under extended RL fine-tuning regimes commonly used to improve reasoning skills.

Specifically, they note that it is unclear whether prolonged RL training—using methods like PPO or GRPO—induces model collapse, which would manifest as degraded capabilities due to distributional shrinkage.

References

Note, however, that it is also unclear whether longer training periods using RL techniques, such as PPO or GRPO, lead to model collapse or not.

Iterative Deployment Improves Planning Skills in LLMs  (2512.24940 - Corrêa et al., 31 Dec 2025) in Subsection: Implications to AI Safety