Theoretical validation of the demonstration-conditioned teacher as near-optimal and minimally deviating
Establish rigorous theoretical guarantees under which conditioning a foundation model on an expert demonstration produces a teacher policy that both achieves expected reward comparable to the unknown optimal policy for the trust-region objective and is, among reward-maximizing policies, closest in Kullback–Leibler divergence to the current policy, thereby justifying the in-context assumption used by Self-Distillation Fine-Tuning that the demonstration-conditioned policy approximates the optimal next policy.
References
While we cannot verify these conditions theoretically, we evaluate each empirically.
— Self-Distillation Enables Continual Learning
(2601.19897 - Shenfeld et al., 27 Jan 2026) in Subsection “Validating the ICL Assumption” (Empirical Validation paragraph)