Stabilizing entropy-based regularization in RLVR training
Develop an entropy-regularization or adaptive entropy-control strategy for reinforcement learning with verifiable rewards (RLVR) applied to training Logics-STEM that maintains a stable entropy loss throughout optimization and yields consistent improvements in accuracy across benchmarks.
References
We experiment with various entropy-based strategies but failed to achieve a stable entropy loss or consistent improvements in accuracy while training, as shown in \cref{fig: entr_contrast}.
— Logics-STEM: Empowering LLM Reasoning via Failure-Driven Post-Training and Document Knowledge Enhancement
(2601.01562 - Xu et al., 4 Jan 2026) in Appendix: Supplementaries for Ablation Studies in RLVR, Subsubsection "Adaptive Entropy Control"