Intermediate behavior under reward interpolation in G-OPD
Establish whether, for Generalized On-Policy Distillation (G-OPD) with reward scaling factor λ satisfying 0 < λ < 1, the student model trained under G-OPD exhibits behavior—specifically in performance metrics and response length—that lies between the behavior of the chosen reference model and that of standard On-Policy Distillation corresponding to λ = 1.
References
We conjecture that, under this setting, the student trained with G-OPD may exhibit behavior (e.g., performance, response length, etc.) that lies between the reference model and the standard OPD with $\lambda=1$.
— Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation
(2602.12125 - Yang et al., 12 Feb 2026) in Section 3.2, paragraph “Reward interpolation and extrapolation in G-OPD”