On-Policy Self-Distillation

Updated 27 January 2026

On-Policy Self-Distillation is a training strategy where a model self-aligns its outputs through on-policy rollouts using a self-referential teacher to mitigate exposure bias.
It employs dual role instantiations with shared parameters to minimize per-token divergences and stabilize training across different architectures.
OPSD has been applied to LLMs, reinforcement learning, and diffusion models, yielding significant improvements in sample efficiency and overall performance.

On-Policy Self-Distillation (OPSD) is a family of training algorithms in which a model self-aligns its policy or generation distribution via distillation losses applied on the distribution induced by its own outputs (“on-policy”), with a self-referential teacher policy defined through privileged context, conditioning, or historical outputs. Unlike conventional distillation that employs a stronger external teacher, OPSD achieves knowledge transfer and performance gains by leveraging the same model under altered conditioning contexts or temporal snapshots. This approach corrects distribution mismatch between training and inference (exposure bias), eliminates the need for external reward models in reasoning tasks, and enhances sample efficiency across domains including LLMs, diffusion models for video generation, reinforcement learning, cross-modal alignment, and small LLM fine-tuning (Zhao et al., 26 Jan 2026, Chern et al., 29 Dec 2025, Hu et al., 23 Jan 2026, Spigler, 2024, Jang et al., 12 Jan 2026, Fu et al., 2024).

1. Core Principles and Frameworks

OPSD fundamentally relies on defining two “roles” for the model: (i) a student policy, which acts autonomously or under partial information, and (ii) a teacher policy, which leverages privileged information or alternative context. Both roles share parameters but are instantiated on different context windows or states.

In mathematical reasoning (Zhao et al., 26 Jan 2026), the student policy $\pi_S$ conditions only on the problem $x$ , while the teacher policy $\pi_T$ conditions on both $x$ and privileged solution trace $y^*$ . Token-level divergences—e.g., KL or Jensen–Shannon—are minimized between distributions $\pi_T(\cdot|x,y^*,u_{<t})$ and $\pi_S(\cdot|x,u_{<t})$ over entire student rollouts.

In cross-modal alignment (Hu et al., 23 Jan 2026), the student follows audio-conditioned decoding, while the internal teacher is the text-conditioned version. Distillation is performed at both the token and sequence levels, utilizing a combination of reverse KL objectives and reinforcement signals.

For reinforcement learning, OPSD is instantiated in Proximal Policy Distillation (PPD) (Spigler, 2024), blending on-policy PPO with a distillation loss toward the fixed teacher, including the special case where student and teacher architectures are identical (self-distillation).

In diffusion-based generative modeling (Chern et al., 29 Dec 2025), a bidirectional “teacher” is distilled into a causal student by distribution-matching losses on student rollouts, with refinements to stabilize on-policy optimization under multimodal conditioning.

For small LLMs, dynamic self-distillation (DynSDPB) (Fu et al., 2024) uses previous mini-batch outputs as pseudo-targets for the current batch, with dynamic scaling of the distillation effect per-sample via uncertainty and discrimination measures, in a fully on-policy fashion.

2. Objectives, Losses, and Algorithmic Workflow

OPSD employs a range of objectives, unified by their on-policy nature:

Per-token divergence minimization (e.g., KL divergence, Jensen–Shannon, cross-entropy) between teacher and student, evaluated on student-generated prefixes.
Policy-gradient alternatives using sampled-token objectives, such as advantages $A_t = \log \pi_T(u_t|x,y^*,u_{<t}) - \log \pi_S(u_t|x,u_{<t})$ , as in (Zhao et al., 26 Jan 2026).
Sequence-level reinforcement signals: OPSD is compatible with judge-based binary rewards and group relative policy optimization (GRPO), rewarding trajectories aligned with the internal teacher or reference answer (Hu et al., 23 Jan 2026).
Distribution-matching in non-discrete domains: For diffusion models, the loss comprises L2 denoising (ODE-initialization) and score-matching gradients, jointly regularized through on-policy critic updates (Chern et al., 29 Dec 2025).
Self-distillation across training steps: For DynSDPB, the KL divergence is between the previous and current model logits, scaled dynamically by uncertainty and discrimination of the predicted outputs (Fu et al., 2024).
Adaptive target reformulation: The Veto method (Jang et al., 12 Jan 2026) creates a geometric bridge distribution $q_\beta$ interpolating between teacher and student, with $\beta$ controlling the trade-off between reward-seeking and diversity in the gradient dynamics.

High-Level Algorithm (OPSD in Mathematical Reasoning)

For each input $x$ 0, sample student rollout $x$ 1.
For each step $x$ 2, evaluate $x$ 3 and $x$ 4 on same prefix.
Compute per-token divergence $x$ 5.
Average across time and batch; backpropagate only through student logits.
Update shared model parameters $x$ 6 (Zhao et al., 26 Jan 2026).

3. Mitigating Distribution Mismatch and Instabilities

A primary motivation for OPSD is overcoming “exposure bias” inherent to off-policy distillation, where the student is supervised on teacher prefixes it never encounters at inference. On-policy rollouts guarantee that supervision is applied on the actual state-visitation distribution of the student.

Gradient instability can arise in forward-KL objectives when the student and teacher diverge significantly, leading to pathological updates on rare tokens (Jang et al., 12 Jan 2026). The Veto objective introduces a $x$ 7-tunable product-of-experts distribution, damping gradients for low-confidence tokens and controlling diversity explicitly, with $x$ 8 recovering full entropy regularization and $x$ 9 yielding pure reward-seeking.

For multimodal and autoregressive diffusion models, naive self-forcing can cause catastrophic collapse (e.g., flicker, black frames) (Chern et al., 29 Dec 2025). Stabilization is achieved by staged ODE-initialization, careful condition curation, and aggressive learning-rate/guidance scheduling during distillation.

In reinforcement learning, PPD employs PPO’s clipping and advantage normalization, and caps the impact of the distillation loss to retain exploration and robustness to imperfect teachers (Spigler, 2024).

4. Practical Implementations and Hyperparameter Choices

OPSD exhibits flexibility across architectures and tasks. Key hyperparameters and configurations include:

Token-level divergences: KL or Jensen–Shannon at each generation step; in CORD, “importance-aware” weights amplify early steps or those with high divergence (Hu et al., 23 Jan 2026).
Batching: One student rollout per sample suffices; in RL and some reasoning tasks, baselines query multiple rollouts per prompt (Zhao et al., 26 Jan 2026, Spigler, 2024).
Condition freezing: Teacher is often “frozen” to an initial parameter snapshot (for stability), or realized as a context-augmented variant, without parameter updates (Zhao et al., 26 Jan 2026).
Dynamic scaling: DynSDPB utilizes per-sample uncertainty and discrimination scores to rescale distillation weight and temperature, suppressing self-reinforcement of erroneous predictions in early training (Fu et al., 2024).
Optimization: AdamW (LLM/SLM settings), learning-rate schedules (linear warmup, cosine decay), and, for RL, standard PPO hyperparameters as in SB3 (Spigler, 2024).
Multimodal guidance: In diffusion, high-quality text/image/audio prompts and classifier-free guidance schedules directly impact stability and metric improvements (Chern et al., 29 Dec 2025).

5. Empirical Findings and Comparative Performance

OPSD methods consistently outperform both supervised fine-tuning and off-policy distillation baselines across diverse domains:

Setting	OPSD Improvement	Reference
Math reasoning	4–8x token efficiency over GRPO; +1–2 absolute over SFT	(Zhao et al., 26 Jan 2026)
RL (Atari)	PPD: 1.11× teacher (self), 0.97× (small student)	(Spigler, 2024)
LALMs (audio)	CORD bridges gap to text, improves with only 80k samples	(Hu et al., 23 Jan 2026)
Video diffusion	20× lower latency, FID 27.10→11.67, matches bidirectional baseline	(Chern et al., 29 Dec 2025)
SLMs (GLUE)	DynSDPB: +7.7 (RTE), +3.9 (CoLA) vs finetune	(Fu et al., 2024)
Veto (reasoning, code)	GSM8K: +4.8 over on-policy KD, H-Eval: +0.7/5.8	(Jang et al., 12 Jan 2026)

Use of on-policy rollouts ensures performance improvements are realized on the actual deployed distribution, with sample efficiency advantages evident in both language and vision settings. OPSD is robust against teacher imperfections when the RL loss is maintained as in PPD (Spigler, 2024).

6. Extensibility, Variants, and Integration

OPSD is compatible with a wide spectrum of training regimes:

Cross-modal policy alignment by conditioning the model differently per modality (as in CORD) (Hu et al., 23 Jan 2026).
Distillation across time slices or SGD steps, e.g., DynSDPB’s previous-batch distillation (Fu et al., 2024).
Integration with other self-training/self-correction frameworks, by treating the OPSD loss as a regularizer combined with direct preference optimization (DPO) or other consistency losses.
Dynamic target reformulation for stability, such as in Veto (Jang et al., 12 Jan 2026).

Notably, OPSD methods are model-agnostic and task-agnostic, requiring minimal or no modifications to model architectures or pipeline logic, aside from distinctions in rollouts and context conditioning.

7. Limitations and Prospective Directions

While OPSD successfully addresses exposure bias and sample inefficiency, certain variants may require careful hyperparameter tuning (e.g., $\pi_T$ 0 for RL, $\pi_T$ 1 in Veto, dynamic scaling in SLMs). In multimodal settings, failure to curate conditioning signals results in instability and failure to match strong generation metrics (Chern et al., 29 Dec 2025). Teacher “freezing” is sometimes required for stability but might hinder knowledge adaptation if privileged traces evolve.

Prospective research avenues include curriculum learning for progressive problem difficulty (Zhao et al., 26 Jan 2026), further optimization of multimodal stabilization (Chern et al., 29 Dec 2025), and expanded integration with complex reinforcement learning environments and real-time systems.

References:

"Self-Distilled Reasoner: On-Policy Self-Distillation for LLMs" (Zhao et al., 26 Jan 2026)
"LiveTalk: Real-Time Multimodal Interactive Video Diffusion via Improved On-Policy Distillation" (Chern et al., 29 Dec 2025)
"CORD: Bridging the Audio-Text Reasoning Gap via Weighted On-policy Cross-modal Distillation" (Hu et al., 23 Jan 2026)
"Proximal Policy Distillation" (Spigler, 2024)
"Stable On-Policy Distillation through Adaptive Target Reformulation" (Jang et al., 12 Jan 2026)
"Dynamic Self-Distillation via Previous Mini-batches for Fine-tuning Small LLMs" (Fu et al., 2024)