Intermediate behavior under reward interpolation in G-OPD

Establish whether, for Generalized On-Policy Distillation (G-OPD) with reward scaling factor λ satisfying 0 < λ < 1, the student model trained under G-OPD exhibits behavior—specifically in performance metrics and response length—that lies between the behavior of the chosen reference model and that of standard On-Policy Distillation corresponding to λ = 1.

Background

The paper introduces Generalized On-Policy Distillation (G-OPD), extending standard OPD by adding a flexible reference model and a reward scaling factor λ that adjusts the weight of the implicit reward relative to the KL regularization. This unifies OPD with dense KL-constrained RL while enabling new regimes of training behavior.

Within this framework, the authors identify the case 0 < λ < 1 as “reward interpolation,” where the optimal solution suggests a log-probability interpolation between teacher and reference distributions. They explicitly conjecture a corresponding empirical behavior: that performance and response length for the trained student will lie between those of the reference model and the λ = 1 OPD solution.

References

We conjecture that, under this setting, the student trained with G-OPD may exhibit behavior (e.g., performance, response length, etc.) that lies between the reference model and the standard OPD with $\lambda=1$.

Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation  (2602.12125 - Yang et al., 12 Feb 2026) in Section 3.2, paragraph “Reward interpolation and extrapolation in G-OPD”