Efficient RL Training for LLMs with Experience Replay
This lightning talk challenges the prevailing belief that reinforcement learning fine-tuning of large language models must discard training samples immediately after use. The presentation demonstrates how experience replay—reusing generated trajectories from a buffer—can achieve up to 40% compute savings without sacrificing accuracy, and in many cases improves training stability and generalization. Grounded in rigorous theoretical analysis and empirical validation on mathematical reasoning tasks, this work reframes sample efficiency in LLM alignment as a tractable optimization problem with significant practical implications for cost-sensitive deployment.Script
The standard approach to reinforcement learning fine-tuning of language models throws away training data immediately after using it once. This paper demonstrates that assumption costs you 40% more compute than necessary.
The prevailing wisdom says that reusing old samples introduces fatal off-policy drift. When your model updates, previously generated trajectories become stale, potentially biasing gradient estimates and destabilizing training. So the entire research community has accepted the computational cost of perpetual regeneration.
But what if the bias-variance mathematics tell a different story?
The authors formalize this as a three-way optimization problem. Their analysis reveals that as the cost of generating rollouts increases relative to backpropagation, the optimal buffer size and sample reuse both grow. Off-policy noise, when controlled, doesn't destroy performance—it regularizes it.
When tested on challenging mathematical reasoning benchmarks, carefully tuned replay buffers delivered exactly what the theory predicted. Training reached the same final accuracy using 40% less compute, with the added benefit of dramatically more stable optimization dynamics and improved performance on pass at k metrics for k greater than 1.
The practical recipe involves three components. First, tune buffer size to the Pareto frontier where marginal staleness cost equals marginal compute savings. Second, preferentially retain trajectories that solved the task correctly. Third, architect your pipeline so inference workers and trainers operate asynchronously, each saturating their respective hardware.
This work dismantles the on-policy dogma in language model reinforcement learning, proving that the samples you discard today could have trained a better model tomorrow. Visit EmergentMind.com to explore this paper further and create your own research video.