Avoiding convergence and diversity collapse in RL with execution rewards
Develop reinforcement learning algorithms beyond standard GRPO for finetuning large language models to generate research ideas using execution rewards that avoid convergence on a small set of easy-to-implement ideas and prevent the associated collapse in thinking length and idea diversity, so as to improve the upper bound of idea quality in open-ended AI research environments.
References
Avoiding such convergence and collapse is an open problem and likely requires new algorithmic interventions beyond standard GRPO, which is beyond the scope of this work.
— Towards Execution-Grounded Automated AI Research
(2601.14525 - Si et al., 20 Jan 2026) in Section 5: Reinforcement Learning from Execution Reward, Analysis of Training Dynamics