On-Policy Distillation Framework

Updated 17 February 2026

The paper’s main contribution shows that on-policy distillation leverages student-driven sampling to align with teacher feedback via token- and trajectory-level objectives.
The methodology unifies techniques such as token-level KL divergence, reward shaping, and trajectory verbal scoring to improve policy optimization.
The framework offers practical benefits in token efficiency, performance scaling, and training stability over off-policy methods by reducing exposure bias.

On-policy distillation frameworks comprise a family of algorithms and recipes in which a student policy is optimized directly on trajectories sampled from its own distribution, using dense feedback supplied by a teacher. Unlike off-policy methods that rely on static datasets or expert rollouts, on-policy distillation maintains alignment between the distributions encountered during training and those experienced at inference, thus mitigating exposure bias and distribution shift. Recent advances unify and extend the theoretical and practical scope of on-policy distillation across LLMs, multimodal transformers, reinforcement learning agents, and diffusion models, enabling both logit-level and trajectory-level credit assignment. This article synthesizes the methodological foundations, canonical objectives, representative instantiations, and comparative evidence for on-policy distillation frameworks.

1. Formal Structure and Principal Objectives

On-policy distillation frameworks operate within a controlled sequence prediction or decision process, typically formalized as a Markov decision process (MDP) or an autoregressive generative model. Let $\pi_S$ denote the student policy parameterized by $\theta$ , and $\pi_T$ the teacher policy. Trajectories $\tau = (s_1,a_1,\ldots,s_T,a_T)$ are always sampled under $\pi_S$ , i.e., $\tau \sim \pi_S$ , with rollouts possibly autoregressive in sequence models ( $s_t$ are prefixes, $a_t$ tokens) or environment states/actions in RL.

The canonical objective is to minimize a divergence or maximize alignment between the student and teacher policies, evaluated on-policy—that is, on samples drawn from the student: $\mathbb{E}_{\tau \sim \pi_S} \left[ \sum_{t=1}^T D\left( \pi_T(\cdot|h_t) \| \pi_S(\cdot|h_t) \right) \right]$ where $D$ may be the forward KL, reverse KL, Jensen–Shannon, or a metric appropriate to the domain and informational constraints.

Key variations include:

Token-level/on-policy KL: As in OPSD, the sum of KL divergences between teacher and student token distributions, on student rollouts (Zhao et al., 26 Jan 2026).
Reward-shaped and entropy-regularized variants: Framed as RL with dense token-level rewards shaped by teacher probabilities (Yang et al., 12 Feb 2026, Czarnecki et al., 2019).
Trajectory-level and group-based objectives: Using reward functions over entire output sequences, such as GRPO, judge-based alignment, or verbal scoring (Bousselham et al., 27 Oct 2025, Xiong et al., 29 Jan 2026, Hu et al., 23 Jan 2026).

This on-policy construction allows feedback on the precise distributions and failure modes encountered by the student during generation or agent behavior.

2. Archetypal Frameworks and Instantiations

Recent literature demonstrates a diversity of architectural instantiations:

On-Policy Self-Distillation (OPSD): The LLM is partitioned into a shared-weights teacher and student, differentiated by context. Privileged ground-truth traces $y^*$ are provided to the teacher context; the student receives only $x$ . The loss is the expected per-token KL over student-generated rollouts (Zhao et al., 26 Jan 2026).
Generalized On-Policy Distillation (G-OPD): Extends standard OPD by introducing a reference policy $\pi_r$ and reward-extrapolation factor $\alpha$ . The objective interpolates between pure KL, dense RL, and reward-amplified regimes, allowing for student performance to surpass the teacher via $\alpha>1$ (ExOPD). If available, rewards can be corrected by choosing $\pi_r$ as the pre-RL teacher base model (Yang et al., 12 Feb 2026).
On-Policy Context Distillation (OPCD): Contextual knowledge (such as solution traces or system prompts) is internalized by aligning a context-free student to the context-aware teacher's distributions via on-policy reverse KL over generated sequences. This framework underpins both experiential distillation and system prompt distillation, outperforming off-policy methods in reasoning, games, and safety classification (Ye et al., 12 Feb 2026).
On-Policy Verbal Distillation (OVD): Overcomes memory and exploration constraints in RL by using trajectory-level discrete teacher scores (0–9) rather than token-level alignment, supporting black-box teachers and semantically meaningful, memory-efficient reinforcement signals (Xiong et al., 29 Jan 2026).
Group Relative Policy Optimization (GRPO) with On-Policy Distillation: Combines value-free on-policy RL (GRPO) at the trajectory level with dense token- or step-wise reverse KL to a teacher, often requiring cold-start SFT to prevent high-variance gradients from off-support rollouts (as in VOLD for multimodal reasoning (Bousselham et al., 27 Oct 2025)).

A summary of key on-policy distillation objectives is provided below.

Framework	On-Policy Loss Type	Teacher Context	Domain(s)
OPSD	Token-level forward KL	Ground-truth trace	LLM, reasoning
OPCD	Token-level reverse KL	In-context prompt	LLM, system
G-OPD/ExOPD	Dense reward + KL (tunable)	Any (flexible)	LLM, RL
OVD	Trajectory-level verbal	Discrete score	RL, QA, math
VOLD	Traj RL + on-policy RKL	Text-only teacher	Multimodal RL

3. Algorithmic Structure and Optimization

On-policy distillation is characterized by "student-driven" sampling: all trajectories, states, or action sequences used for computing losses and gradients are drawn on-policy from the student. Minimization typically employs stochastic gradient descent on the per-step or per-trajectory objective, with full gradients flowing only through student parameters.

Canonical pseudocode steps for OPSD-style recipe (Zhao et al., 26 Jan 2026):

For each (x, y*) in the batch:
- Sample student rollout $\tau \sim \pi_S(\cdot|x)$ .
- Compute $\pi_T(\cdot|h_{t-1}, x, y^*)$ and $\pi_S(\cdot|h_{t-1}, x)$ for each $t$ .
- Accumulate per-time KL.
Average loss, update $\theta$ via gradient descent.

Variants replace the teacher with peers (as in multi-agent/group distillation (Yu et al., 2024)), swap teacher context for internal information, or utilize trajectory-level discrimination and adversarial signals (Ye et al., 13 Nov 2025). For frameworks involving reward-shaping or RL views, gradients follow policy gradient or REINFORCE forms, with reward terms structured to reflect KL, logit-ratio, or other dense credit assignment (Yang et al., 12 Feb 2026, Czarnecki et al., 2019).

Full discretization (as in OVD) admits scalable updates in memory-limited settings and enables non-differentiable feedback or black-box teacher compatibility (Xiong et al., 29 Jan 2026).

4. Comparative Empirical Evidence and Performance

On-policy distillation frameworks consistently demonstrate significant gains over off-policy supervised fine-tuning or off-policy distillation in a range of tasks:

Token Efficiency: OPSD achieves 4–8× better token efficiency compared to GRPO RL and outperforms off-policy distillation on reasoning benchmarks (AIME’24/’25, HMMT’25, AMO-Bench) (Zhao et al., 26 Jan 2026).
Performance Scaling: ExOPD (with reward extrapolation $\alpha>1$ ) surpasses standard OPD and sometimes outperforms the teacher, both in single-domain and multi-domain (merged experts) settings (Yang et al., 12 Feb 2026).
Knowledge Internalization and OOD Robustness: OPCD enables students to match or exceed in-context performance without context injection and avoids OOD degradation, outperforming off-policy context distillation in tasks such as MedMCQA and Frozen Lake (Ye et al., 12 Feb 2026).
Memory and Compute Efficiency: OVD provides orders-of-magnitude lower memory usage for long-context RL, supporting larger batch and context sizes, and achieves up to +25.7% gain on mathematical reasoning tasks compared to token-level methods (Xiong et al., 29 Jan 2026).
Stability: Adaptive geometric bridging (e.g., Veto (Jang et al., 12 Jan 2026)) stabilizes forward KL training in student–teacher pairs with large initial divergence.
Multimodal and Black-Box Extensions: On-policy distillation via trajectory-level feedback applies to video diffusion (Chern et al., 29 Dec 2025), audio-text alignment (Hu et al., 23 Jan 2026), and closed-source teacher black-box scenarios (GAD (Ye et al., 13 Nov 2025)).

Experimental ablations universally demonstrate that strict on-policy alignment—matching to teacher, peer, or context-aware reference distributions on states actually encountered by the student—enables both improved final task accuracy and greater training stability, especially in settings prone to exposure bias.

5. Domain-Extended Variants and Theoretical Analysis

On-policy distillation has been extended to settings beyond straightforward teacher–student pairs:

Peer and Cohort Distillation: In multi-agent or population settings (Online Policy Distillation with Decision-Attention (Yu et al., 2024), CTEDD (Chen, 2019)), each agent aligns to a weighted mixture of peer outputs, with attention modules mediating information flow without relying on a fixed (single) teacher.
Self-Distillation and Oracle Distillation: A single model can self-distill by alternating between privileged and restricted contexts (as in OPSD (Zhao et al., 26 Jan 2026), CORD (Hu et al., 23 Jan 2026)), or align to an oracle teacher with privileged information in a universal trading setting (Fang et al., 2021).
Generalized objectives: The theoretical analysis establishes that standard OPD is equivalent to a dense RL problem with reward as the token-wise log-ratio to a reference, and the KL penalty, both equally weighted. By decoupling these, G-OPD enables reward extrapolation/interpolation and reward correction via reference choice (Yang et al., 12 Feb 2026).
Stability Guarantees: Adaptive logit-space bridging (Veto) regularizes the per-step loss to suppress gradients on out-of-support or low-confidence tokens, theoretically ensuring bounded gradients and convergence to temperature-sharpened optima (Jang et al., 12 Jan 2026).
Convergence and Variance Reduction: Expected entropy-regularized objectives (Czarnecki et al., 2019) comprise proper gradient fields, guaranteeing stable convergence, while mixture verbal distillation approaches (OVD) provably reduce gradient variance and maintain unbiased updates even under black-box feedback (Xiong et al., 29 Jan 2026).

6. Implementation Considerations and Practical Guidance

Implementation success depends critically on prompt and context design, warm-start initialization, and objective selection:

Prompt Construction: For OPSD, privileged traces are incorporated in the teacher’s prompt only at training time; students are prompted solely with the bare question (Zhao et al., 26 Jan 2026).
Sample Efficiency: Token-wise and trajectory-wise supervision enables orders-of-magnitude savings relative to RL with sparse reward or multiple rollouts (e.g., GRPO baselines require $8\times$ more samples per problem than OPSD).
Objective Selection: Full-vocabulary KL (“logit distillation”) consistently outperforms sampled-token policy-gradient objectives; forward-vs-reverse KL tradeoffs should be tuned to balance performance and stability (Zhao et al., 26 Jan 2026, Jang et al., 12 Jan 2026).
Initialization: Cold-start SFT (using teacher-generated data) is often critical when the student distribution is initially far from the teacher’s support, preventing noisy and destabilizing KL gradients (Bousselham et al., 27 Oct 2025).
Hyperparameter Tuning: Settings such as learning rate schedules, blending coefficients for reward extrapolation ( $\alpha$ ), KL regularization, batch sizes, and length of rollouts are tuned empirically for stability and efficiency in large-scale experiments (Zhao et al., 26 Jan 2026, Yang et al., 12 Feb 2026).
Memory Management: For long-context models, trajectory-level scoring or discrete score distillation (OVD) dramatically reduces GPU memory and enables larger batch/horizon ratios, essential for interactive QA and reasoning (Xiong et al., 29 Jan 2026).

7. Impact, Limitations, and Future Directions

On-policy distillation frameworks now underpin state-of-the-art performance across reasoning, generation, multi-agent RL, multimodal grounding, and black-box distillation. The architectural and mathematical flexibility of these approaches allows for incorporation of dense feedback at token, step, or trajectory level; effective knowledge integration across heterogeneous teachers; stability via adaptive targets; and efficient resource allocation through scalable objectives.

Limitations persist where teacher distributions are highly miscalibrated, in settings with insufficient data to estimate per-step divergences, or in applications demanding very fine-grained feedback not amenable to quantization. Hyperparameter sensitivity (e.g., threshold choice in OVD, reward-extrapolation weight, guidance scales in diffusion models) can affect stability or generalization.

Future work may extend these frameworks to fully causal multi-agent coordination tasks, continuous feedback settings, open-ended multimodal dialogue systems, and broader theory characterizing the exploration vs. exploitation dynamics enabled by on-policy knowledge transfer. The success of cross-modal and context-internalizing distillation suggests broad applicability in self-improving, generalist AI models.