The Many Faces of On-Policy Distillation: Pitfalls, Mechanisms, and Fixes

Published 11 May 2026 in cs.AI | (2605.11182v1)

Abstract: On-policy distillation (OPD) and on-policy self-distillation (OPSD) have emerged as promising post-training methods for LLMs, offering dense token-level supervision on trajectories sampled from the model's own policy. However, existing results on their effectiveness remain mixed: while OP(S)D has shown promise in system prompt and knowledge internalization, recent studies also report instability and degradation. In this work, we present a comprehensive empirical study of when OPD and OPSD work, when they fail, and why. We find that OPD on mathematical reasoning is highly sensitive to teacher choice and loss formulation, whereas OPSD fails in our tested settings due to test-time absence of instance-specific privileged information (PI). In contrast, OPSD is effective when PI represents a shared latent rule, such as a system prompt or alignment preference. We identify three failure mechanisms: (1) distribution mismatch between teacher and student caused by conditioning on student-generated prefixes, (2) optimization instability from biased TopK reverse-KL gradients, and (3) an OPSD-specific limitation where the student learns a PI-free policy that aggregates PI-conditioned teachers, which is insufficient when PI is instance-specific. We further show that stop-gradient TopK objectives, RLVR-adapted teachers, and SFT-stabilized students mitigate these failures.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper presents a thorough empirical and theoretical dissection of on-policy distillation, identifying failures like teacher-student mismatch and biased Top-K reverse KL gradients.
It shows how specific challenges such as semantic conflicts and unstable loss approximations lead to model collapse, with practical fixes including stop-gradient approaches and RLVR teacher adaptation.
It demonstrates that supervised fine-tuning (SFT) stabilizes student outputs, ensuring reliable performance in tasks like reasoning, alignment, and prompt internalization.

Comprehensive Analysis of On-Policy Distillation: Failure Modes and Practical Remedies

Introduction and Motivation

On-policy distillation (OPD) and on-policy self-distillation (OPSD) have emerged as core post-training methodologies for LLMs, leveraging the student’s own policy rollouts for dense token-level supervision. OPD provides a framework for integrating external teacher models, offering an avenue for knowledge transfer and mitigating issues such as catastrophic forgetting and sample inefficiency. OPSD, in contrast, uses the student model itself augmented with privileged information (PI) as the teacher, aiming to distill context or alignment behaviors. However, practical deployments reveal a multifaceted landscape: while successes are documented in system prompt internalization and knowledge compression, multiple recent studies report instability, degradation, and outright failure modes, particularly in reasoning tasks. The paper "The Many Faces of On-Policy Distillation: Pitfalls, Mechanisms, and Fixes" (2605.11182) provides a thorough empirical and theoretical dissection of OPD and OPSD, identifying critical failure mechanisms and prescribing robust stabilizers.

Figure 1: Mapping the OP(S)D design space, mechanisms of failure, and corresponding practical fixes.

Mechanisms of Failure in OPD and OPSD

Teacher-Student Distribution Mismatch

A fundamental limitation arises from conditioning the teacher on prefixes generated by the student. The student’s trajectory may diverge significantly from the teacher’s optimal reasoning path, leading to semantic conflict and locally incompatible token-level supervision. Empirical analysis demonstrates that the teacher’s accuracy substantially decreases when forced to continue from truncated student trajectories compared to standalone reasoning.

Figure 2: Local semantic conflict induced by student prefixes; teacher supervises branch switching rather than branch refinement, often via revision tokens.

Instability of Top-K Reverse KL Objectives

To make full-vocabulary KL objectives computationally tractable, Top-K approximations are typically employed. However, the unnormalized Top-K reverse KL introduces biased gradient terms due to incomplete cancellation of constant terms, destabilizing optimization and culminating in model collapse and repetitive behaviors.

Figure 3: OPSD under math reasoning fails to yield improvement; student models collapse with verbose or degenerate outputs.

Aggregation Across Privileged Information in OPSD

OPSD attempts to marginalize across PI-conditioned teachers, learning a PI-free consensus policy. When PI is instance-specific rather than reflecting a shared latent rule, this aggregation causes suppression of outputs supported only in certain PI contexts and further weakens the distilled model relative to PI-conditioned teachers.

Figure 4: Effectiveness of OPSD heavily depends on the nature and structure of privileged information.

Empirical Characterization Across Tasks

Mathematical Reasoning

OPSD is empirically ineffective for math reasoning—neither answer-only nor full-response PI yields improvement, and RLVR-trained teachers further exacerbate mismatch. OPD using stronger teachers provides initial gains but collapses after successive training steps, manifesting as length explosion and repetitive token sequences.

Figure 5: OPD, GRPO, and PPO comparison on style alignment tasks; OPSD demonstrates superior sample efficiency.

Alignment and Prompt Internalization

For tasks such as style alignment (CharacterBench, EmotionBench) and system prompt internalization, OPSD excels. Here, PI represents a fixed latent rule or prompt, allowing the student to internalize alignment protocols and achieve rapid convergence versus RL baselines.

Figure 6: OPSD versus GRPO on reasoning compression; OPSD provides similar accuracy but more efficient compression of response length.

Task-specific analyses highlight that effectiveness is determined by PI structure: global PI (system prompts, style instructions) admits successful distillation, while instance-specific PI induces conflicting supervision.

Practical Fixes and Stabilizing Strategies

Stop-Gradient and Renormalized Top-K KL Losses

Biased gradients in unnormalized Top-K reverse KL are mitigated with a stop-gradient formulation, renormalization within the Top-K set, or sampled-token policy-gradient objectives. These corrections ensure stability and prevent collapse.

Figure 7: Comparison between biased reverse KL and stopgrad-based approaches; stop-gradient variant achieves stable performance.

RLVR Teacher Adaptation

Adapting the teacher with RLVR on the student’s distribution aligns the teacher’s outputs more closely with student prefixes, decreasing mismatch and boosting distillation efficacy even when teacher benchmark accuracy is not superior.

Figure 8: RLVR-adapted teacher and standard teacher comparison; adapted teacher distribution aligns more tightly with student distribution.

Supervised Fine-Tuning (SFT) Stabilization

SFT on teacher-generated traces regularizes the student’s output space, prevents degenerate token sequences, and ensures well-formed regions during OPD. This is particularly evident in cases where students initially generate non-semantic outputs or garbled sequences.

Figure 9: SFT regularization prevents collapse and length explosion, maintaining output consistency on alignment and safety tasks.

Analysis of Supervision Signal and Token Dynamics

Teacher supervision is found to be correctness-skewed and position-dependent: the supervision signal is stronger on incorrect trajectories and early tokens, gradually fading on correct answers and longer responses. PI and teacher model scale modulate the distribution of supervision, but fundamental signal quality is governed by teacher capability.

Figure 10: Heatmap visualization of token-level supervision; PI refines granularity but teacher scale dictates overall distribution.

Implications and Future Directions

The study conclusively establishes that OPD and OPSD are not universally reliable and their efficacy is contingent on task structure, teacher adaptation, loss formulation, and PI properties. The findings direct future research toward hybrid pipelines combining SFT initialization, RL task-specific optimization, and OPD distillation, potentially unlocking iterative self-improvement mechanisms for LLMs. The theoretical analysis further invites exploration in scalable distillation objectives and active teacher-student curriculum design to balance generalization and specialization.

Conclusion

"The Many Faces of On-Policy Distillation: Pitfalls, Mechanisms, and Fixes" provides a granular analysis of OPD and OPSD, revealing critical failure mechanisms and prescriptive remedies. Effectiveness is fundamentally constrained by teacher-student distribution compatibility, loss formulation, and privileged information structure. Stabilization via stop-gradient KL surrogates, RLVR adaptation, and SFT regularization ensures practical training stability and effective distillation. These insights lay the foundation for more robust post-training protocols and advances in dynamic LLM alignment, system internalization, and reasoning compression.

Markdown Report Issue