Scaling ETR to Large LLMs

Determine the computational efficiency, convergence stability, and task performance of Elastic Trust Regions (ETR) when applied to large-scale language models (e.g., models with 10B or more parameters).

Background

The paper introduces Elastic Trust Regions (ETR), a dynamic clipping mechanism for Group Relative Policy Optimization (GRPO) in Reinforcement Learning with Verifiable Rewards (RLVR). ETR adapts the trust region based on token-level advantage magnitude and group-level pass-rate variance, aiming to better utilize high-quality signals and suppress noise. The experimental validation is conducted on medium-scale models such as Qwen3-8B-Base, Llama-3.1-8B-Instruct, and Qwen2.5-7B-Math-Base.

The authors explicitly note that their evaluation does not include larger-scale LLMs (e.g., 10B+ parameters). Consequently, it remains unresolved whether ETR maintains computational efficiency, convergence stability, and performance when scaled to substantially larger model capacities, which is critical for assessing the method’s practicality in state-of-the-art systems.

References

The computational efficiency, convergence stability, and performance of ETR’s adaptive thresholding when scaled to such sizes remain unexamined.

ETR: Outcome-Guided Elastic Trust Regions for Policy Optimization  (2601.03723 - Zhang et al., 7 Jan 2026) in Section: Limitations