- The paper shows that replacing fixed policy anchors with an EMA anchor significantly stabilizes reinforcement learning in LLMs.
- It introduces a Top-k KL estimator that offers unbiased, memory-efficient token-level divergence estimation for high-dimensional autoregressive models.
- Empirical results confirm notable improvements in math reasoning and agentic QA benchmarks without requiring architectural changes.
EMA Policy Gradient for LLM RL: Stabilization via EMA Anchor and Top-k KL
Overview
This paper addresses two major algorithmic challenges in scaling reinforcement learning (RL) for LLMs: instability induced by fixed policy anchors during RL updates, and suboptimal KL divergence regularization due to memory or estimator issues in high-dimensional autoregressive spaces. The authors introduce EMA Policy Gradient (EMA-PG), a generic modification applicable to policy gradient RL methods for LLMs. EMA-PG consists of (1) replacing the fixed anchor policy in KL regularization with an exponential moving average (EMA) anchor, and (2) using a Top-k KL estimator to interpolate between exact and sampled KL, enabling unbiased and memory-efficient token-level KL regularization. These design choices, in conjunction, yield significant improvements in both reasoning and agentic LLM RL benchmarks, neither requiring architectural tweaks nor restricted to specific RL algorithms.
EMA Anchor Policy for KL Regularization
The standard approach in RL for LLMs uses KL divergence against a fixed anchor (usually the original or reference model) to encourage the policy to remain close to its initial behavior, improving stability and preventing collapse. The paper proposes updating this anchor using an EMA of model weights, akin to the target network concept from Q-learning. The anchor πema is recursively defined as:
θt+1ema=ηθtema+(1−η)θt
where η∈(0,1) is the EMA coefficient. This frames regularization against a dynamically changing anchor, capturing policy improvements during RL without sharp policy drifts.
A thorough dynamical analysis is presented, leveraging local-quadratic approximations of KL and Fisher information eigenmodes, to derive stability and oscillation regimes for the EMA anchor update. The closed-form stability condition is:
αβλmax<1+η
where α is the learning rate, β the KL coefficient, and λmax the maximum eigenvalue of the Fisher matrix. The analysis clarifies when training with an EMA anchor is stable, oscillatory, or divergent.
Top-k KL: Unbiased, Memory-Efficient Token-Level KL Estimation
In high-dimensional autoregressive LLMs, computing exact token-level KL divergence is computationally and memory intensive. The prevalent practice approximates token-wise KL with sampled estimators (K1, K2, K3), introducing bias in value or gradient estimation. The authors precisely analyze the limitations of popular sampled KL estimators: e.g., K1/K3/K4 yield biased gradients, and K2 yields biased KL values.
To address this, a Top-k KL estimator is proposed:
- Compute exact KL on the logit indices corresponding to the k most probable tokens ("head").
- Use a masked sampled KL estimator for the "tail".
This Top-k approach is memory-efficient (O(k) per token), unbiased for both KL value and gradient regardless of k, and can seamlessly interpolate between fully sampled (low k) and exact (full V) KL. The method is generalizable to f-divergence objectives, not just KL.
Empirical results on synthetic and real data confirm that Top-k KL not only reduces gradient variance but, past a modest critical sample size, outperforms truncated (biased) KL estimators. The practical prescription is to use k∈[16,32] and tail correction, especially for RL at scale.
Experimental Results
The proposed EMA-PG, applied to standard RL algorithms (e.g. GRPO), is evaluated on both math reasoning (OlympiadBench, MATH500, Minerva, AIME) and agentic QA tasks (HotpotQA, 2WikiMultiHopQA, Musique, Bamboogle) with state-of-the-art LLMs (e.g., Qwen-1.5B, Qwen-3B).
- Math Reasoning: EMA-PG delivers a notable Pass@1 increase on OlympiadBench: 50.8% (GRPO) → 53.9% (EMA-PG). Improvements extend to Pass@N metrics, indicating enhanced exploration and stability against mode collapse.
- Agentic RL for Search: On HotpotQA, 2WikiMultiHopQA, and Bamboogle, EMA-PG outperforms GRPO by 33.3% on average (e.g., 29.7% → 44.1% on HotpotQA). These gains are robust across various EMA coefficients (η∼0.9−0.95) and apply to both reverse and forward KL regularization.
Ablation studies verify:
- EMA anchor yields consistent improvements regardless of the base algorithm.
- Token-level KL regularization, not sequence-level, is essential—sequence KL underperforms in all reasoning settings.
- Top-k KL provides superior sample efficiency and asymptotic RL performance; correct tail and off-policy corrections are critical for unbiased learning dynamics.
Theoretical and Practical Implications
The findings have several implications:
- Theoretical: The analysis provides sharp necessary and sufficient stability/oscillation conditions for RL with EMA anchors, which aids in principled hyperparameter selection. The unbiased Top-k KL construction generalizes to arbitrary f-divergences, supporting future work in regularizing LLMs beyond KL (e.g., Jensen-Shannon, total variation, χ2, α-divergences).
- Practical: EMA-PG can be applied to any policy-gradient framework for LLM fine-tuning or alignment with minimal code changes. Memory bottlenecks associated with large vocabularies are mitigated. By enabling faithful token-level KL regularization, LLMs trained with EMA-PG demonstrate superior reasoning and agentic behaviors on verifiable benchmarks.
Outlook
Given its generality and empirical efficacy, EMA-PG may become standard in LLM RL pipelines. The approach is orthogonal to on-policy/off-policy data generation and to the specific advantage estimation method. Future directions include further exploration of Top-k f-divergence estimators, extensions to offline/on-policy LLM distillation, and integration of unbiased sampled divergence estimators with alternative policy gradient operators.
Conclusion
The paper thoroughly demonstrates that replacing fixed policy anchors with an EMA anchor, and regularizing with Top-k KL estimators, resolves key stability and estimator bias issues in RL for LLMs. EMA-PG delivers consistent, statistically significant performance improvements in both math reasoning and agentic RL settings, without increasing computational overhead. These insights have both theoretical and practical traction for current and future LLM RL research and deployment (2602.04417).