Stabilizing entropy-based regularization in RLVR training

Develop an entropy-regularization or adaptive entropy-control strategy for reinforcement learning with verifiable rewards (RLVR) applied to training Logics-STEM that maintains a stable entropy loss throughout optimization and yields consistent improvements in accuracy across benchmarks.

Background

The authors experimented with adding entropy-based terms, including adaptive entropy control following prior work, to reinforcement learning with verifiable rewards (RLVR) during post-training. Entropy bonuses are commonly used in RL to encourage exploration and stabilize policies, but their integration into LLM RL training can be challenging.

In this work, attempts to incorporate entropy losses resulted in unstable entropy dynamics and a lack of consistent accuracy gains. The instability, characterized as entropy explosion under certain settings, led the authors to exclude entropy terms from their final training recipe. This leaves unresolved how to effectively leverage entropy regularization within RLVR to improve Logics-STEM without destabilizing training.

References

We experiment with various entropy-based strategies but failed to achieve a stable entropy loss or consistent improvements in accuracy while training, as shown in \cref{fig: entr_contrast}.

Logics-STEM: Empowering LLM Reasoning via Failure-Driven Post-Training and Document Knowledge Enhancement  (2601.01562 - Xu et al., 4 Jan 2026) in Appendix: Supplementaries for Ablation Studies in RLVR, Subsubsection "Adaptive Entropy Control"