Avoiding convergence and diversity collapse in RL with execution rewards

Develop reinforcement learning algorithms beyond standard GRPO for finetuning large language models to generate research ideas using execution rewards that avoid convergence on a small set of easy-to-implement ideas and prevent the associated collapse in thinking length and idea diversity, so as to improve the upper bound of idea quality in open-ended AI research environments.

Background

The paper applies GRPO to finetune Qwen3-30B using execution rewards in two open-ended research environments (LLM post-training on MATH with GRPO and LLM pre-training with nanoGPT). While reinforcement learning increased the average reward of sampled ideas, it failed to improve the maximum reward per epoch, which is more critical for scientific discovery.

A detailed analysis reveals that RL training causes the model to converge on a few simple, easy-to-implement ideas, alongside a marked decrease in thinking-trace length and a collapse in idea diversity. Longer thinking traces correlate with lower execution success rates, incentivizing shorter outputs. Attempts to mitigate this via dynamic prompts, length rewards, and diversity penalties did not yield clear gains, motivating the need for new algorithms that avoid this convergence and collapse.

References

Avoiding such convergence and collapse is an open problem and likely requires new algorithmic interventions beyond standard GRPO, which is beyond the scope of this work.

Towards Execution-Grounded Automated AI Research  (2601.14525 - Si et al., 20 Jan 2026) in Section 5: Reinforcement Learning from Execution Reward, Analysis of Training Dynamics