Theoretical validation of the demonstration-conditioned teacher as near-optimal and minimally deviating

Establish rigorous theoretical guarantees under which conditioning a foundation model on an expert demonstration produces a teacher policy that both achieves expected reward comparable to the unknown optimal policy for the trust-region objective and is, among reward-maximizing policies, closest in Kullback–Leibler divergence to the current policy, thereby justifying the in-context assumption used by Self-Distillation Fine-Tuning that the demonstration-conditioned policy approximates the optimal next policy.

Background

The core hypothesis behind SDFT is an in-context assumption: a demonstration-conditioned policy approximates the unknown optimal next policy under a trust-region-regularized reinforcement learning objective. The authors identify two requirements for this approximation—near-optimality in expected reward and minimal KL deviation from the current policy among reward-maximizing policies.

While the paper provides empirical support for these conditions, it explicitly notes that they cannot be verified theoretically within the work, leaving a formal proof or characterization as an unresolved question.

References

While we cannot verify these conditions theoretically, we evaluate each empirically.

Self-Distillation Enables Continual Learning  (2601.19897 - Shenfeld et al., 27 Jan 2026) in Subsection “Validating the ICL Assumption” (Empirical Validation paragraph)