Does extended RL fine-tuning (PPO/GRPO) cause model collapse?
Determine whether longer training periods using reinforcement learning techniques such as Proximal Policy Optimization (PPO) or Group Relative Optimization (GRPO) lead to model collapse in large language models.
References
Note, however, that it is also unclear whether longer training periods using RL techniques, such as PPO or GRPO, lead to model collapse or not.
— Iterative Deployment Improves Planning Skills in LLMs
(2512.24940 - Corrêa et al., 31 Dec 2025) in Subsection: Implications to AI Safety