Persistence of TPO Gains at Larger Scale
Determine whether Target Policy Optimization (TPO) maintains its relative performance gains when applied to larger language models of at least 7B parameters and evaluated on harder benchmarks such as MATH and AIME.
References
Testing on larger models (7B+) and harder benchmarks (MATH, AIME) remains future work; the main open question is whether TPO's relative gains persist at larger scale.
— Target Policy Optimization
(2604.06159 - Kaddour, 7 Apr 2026) in Limitations (Section 6), Scale of evaluation paragraph