Persistence of TPO Gains at Larger Scale

Determine whether Target Policy Optimization (TPO) maintains its relative performance gains when applied to larger language models of at least 7B parameters and evaluated on harder benchmarks such as MATH and AIME.

Background

The paper introduces Target Policy Optimization (TPO), a target-matching method for group-based reinforcement learning that forms a target distribution by exponentially tilting the behavior policy with standardized scores, then fits the policy to this target via cross-entropy. Across tabular bandits, transformer sequence tasks, and billion-parameter LLM RLVR, TPO matches or outperforms standard policy-gradient baselines, especially under sparse rewards.

While the authors demonstrate promising results on 1.5–1.7B parameter models across several tasks, they explicitly identify the need to evaluate TPO at larger model scales and on more challenging benchmarks. Establishing whether the observed relative gains persist at scales of 7B+ parameters and on harder datasets such as MATH and AIME is highlighted as the main open question.

References

Testing on larger models (7B+) and harder benchmarks (MATH, AIME) remains future work; the main open question is whether TPO's relative gains persist at larger scale.

Target Policy Optimization  (2604.06159 - Kaddour, 7 Apr 2026) in Limitations (Section 6), Scale of evaluation paragraph