Selecting a Teacher Checkpoint for Sequential Distillation with Privileged Information

Determine a principled method for selecting the optimal checkpoint of a privileged-information-conditioned teacher policy that should be distilled into an unconditioned student policy within a sequential distillation pipeline for multi-turn language-model agents.

Background

The paper studies how to transfer capabilities learned with training-time privileged information (PI) to policies that must act without PI at inference. A naive sequential approach is to first train a PI-conditioned teacher policy and then distill its behavior into an unconditioned student. However, the authors note practical issues with this approach, including uncertainty about which teacher checkpoint to distill, off-policy instability, and inefficiency.

To address these issues overall, the paper proposes π-Distill, a joint objective that trains a PI-conditioned teacher and an unconditioned student with shared parameters. Nonetheless, the specific question of how to select an appropriate teacher checkpoint for a sequential distillation pipeline remains unresolved and is explicitly marked as unclear in the text.

References

A naive solution is to first train a PI-conditioned policy and then distill its behavior into an unconditioned one. In practice, this sequential pipeline introduces several issues. It is unclear which checkpoint of the conditioned policy should be distilled, learning from its trajectories is off-policy and can be unstable, and training the two policies separately is computationally inefficient.

Privileged Information Distillation for Language Models  (2602.04942 - Penaloza et al., 4 Feb 2026) in Section 3, Subsection 'Privileged Information distillation (π-Distill)', Motivation