Not All Rollouts are Useful: Down-Sampling Rollouts in LLM Reinforcement Learning
Abstract: Reinforcement learning with verifiable rewards (RLVR) has emerged as a powerful paradigm for enhancing reasoning capabilities in LLMs. However, it is constrained by a fundamental asymmetry in computation and memory requirements: rollout generation is embarrassingly parallel and memory-light, whereas policy updates are communication-heavy and memory-intensive. To address this, we introduce PODS (Policy Optimization with Down-Sampling). PODS produces numerous rollouts in parallel, then trains on only an informative subset, preserving learning signals while slashing update cost. We instantiate PODS with max-variance down-sampling, a principled criterion that maximises reward diversity and show it admits an $O(n\log n)$ solution. Empirically, coupling PODS with Group Relative Policy Optimization (GRPO) achieves superior performance over standard GRPO across different reasoning benchmarks and hardware environments.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.