Selecting a Teacher Checkpoint for Sequential Distillation with Privileged Information
Determine a principled method for selecting the optimal checkpoint of a privileged-information-conditioned teacher policy that should be distilled into an unconditioned student policy within a sequential distillation pipeline for multi-turn language-model agents.
References
A naive solution is to first train a PI-conditioned policy and then distill its behavior into an unconditioned one. In practice, this sequential pipeline introduces several issues. It is unclear which checkpoint of the conditioned policy should be distilled, learning from its trajectories is off-policy and can be unstable, and training the two policies separately is computationally inefficient.
— Privileged Information Distillation for Language Models
(2602.04942 - Penaloza et al., 4 Feb 2026) in Section 3, Subsection 'Privileged Information distillation (π-Distill)', Motivation