Scaling behavior of OPSD beyond 8B-parameter models

Determine whether the observed improvements from On-Policy Self-Distillation (OPSD) with increasing model size persist for large language models at scales beyond 8 billion parameters, specifically including approximately 70 billion parameters and larger frontier models, when applied to reasoning tasks using OPSD’s on-policy self-distillation setup.

Background

The paper introduces On-Policy Self-Distillation (OPSD), where a single LLM serves as both teacher and student by conditioning on privileged solutions for the teacher and only the problem for the student. OPSD provides dense token-level guidance over the student’s own rollouts without requiring an external teacher.

Empirically, OPSD shows stronger gains as model size increases within the tested range up to 8B parameters, supporting the hypothesis that sufficient capacity is required for effective self-rationalization. However, due to computational constraints, experiments were limited to models ≤8B, leaving the scalability of OPSD to substantially larger models (e.g., ~70B and frontier scales) unresolved.

Understanding whether OPSD continues to yield increasing benefits at higher scales is important for assessing its viability as a post-training method for advanced reasoning models, informing resource allocation and the design of training regimes for large-capacity LLMs.

References

While we observe that larger models benefit more from OPSD-consistent with our hypothesis that self-rationalization requires sufficient model capacity-it remains an open question whether this trend continues at scales beyond 8B parameters, such as 70B or larger frontier models.

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models  (2601.18734 - Zhao et al., 26 Jan 2026) in Section 7. Limitations and Future Directions