Unmasking On-Policy Distillation: Where It Helps, Where It Hurts, and Why

Published 11 May 2026 in cs.LG and cs.AI | (2605.10889v1)

Abstract: On-policy distillation offers dense, per-token supervision for training reasoning models; however, it remains unclear under which conditions this signal is beneficial and under which it is detrimental. Which teacher model should be used, and in the case of self-distillation, which specific context should serve as the supervisory signal? Does the optimal choice vary from one token to the next? At present, addressing these questions typically requires costly training runs whose aggregate performance metrics obscure the dynamics at the level of individual tokens. We introduce a training-free diagnostic framework that operates at the highest resolution: per token, per question, and per teacher. We derive an ideal per-node gradient defined as the parameter update that maximally increases the student's probability of success. We then develop a scalable targeted-rollout algorithm to estimate this gradient efficiently, even for long chains of intermediate thoughts. The gradient alignment score, defined as the cosine similarity between this ideal gradient and any given distillation gradient, quantifies the extent to which a particular configuration approximates the ideal signal. Across a range of self-distillation settings and external teacher models, we observe that distillation guidance exhibits substantially higher alignment with the ideal on incorrect rollouts than on correct ones, where the student already performs well and the teacher's signal tends to become noisy. Furthermore, we find that the optimal distillation context depends jointly on the student model's capacity and the target task, and that no single universally effective configuration emerges. These findings motivate the use of per-task, per-token diagnostic analyses for distillation.

Abstract PDF Upgrade to Chat

Authors (9)

Summary

The paper introduces gradient alignment as a diagnostic measure to compare the teacher’s guidance with the ideal token-level gradient.
It finds that teacher guidance most benefits student performance on incorrect trajectories, with efficacy varying by model size and demonstration clarity.
The study reveals that no universal teacher exists, advocating adaptive, context-aware configurations for optimizing reasoning in language models.

Diagnosing On-Policy Distillation: When and Why Teacher Guidance Improves Reasoning Models

Introduction

This work presents a comprehensive analysis of on-policy distillation (OPD) for LLM reasoning, with a focus on mathematically characterizing when teacher supervision actually improves student model performance at the token level. The authors propose a diagnostic pipeline to quantitatively distinguish beneficial, neutral, and detrimental teacher guidance for each token decision—a resolution previously absent in distillation analysis. Crucially, the framework defines an "ideal" gradient at each token: the direction that maximally boosts the student’s probability of success, measured empirically through targeted rollouts. The alignment between this ideal gradient and the actual distillation gradient provides a token-level diagnostic for teacher configuration utility.

This essay systematically discusses the methodology for computing these alignment scores, the empirical results across tasks and model scales, and provides a critical assessment of the practical and theoretical implications.

Theoretical Framework and Methodology

The core contribution is the definition of the "gradient alignment score": a measure of how closely the teacher’s guidance direction matches the empirically optimal update at each token and context. Let $P_{\mathrm{succ}}^k$ be the empirical probability that, after sampling token $k$ at node $u$ , a student trajectory reaches a correct answer. The ideal objective is $L_\mathrm{ideal}(u) = \sum_{k} P_\theta^k\, P_{\mathrm{succ}}^k$ , which, when differentiated, yields the per-token reference gradient for updating student logits (see Figure 1).

(Figure 1)

Figure 1: Computing the gradient alignment score at a branching node $u$ ; empirical estimates of $P_{\mathrm{succ}}^k$ inform the ideal gradient, which is compared in direction (cosine) to the distillation gradient for that teacher configuration.

The diagnostic is computationally tractable via a targeted, exponentially windowed rollout scheme which efficiently enriches critical branches of the generation tree. The alignment score (cosine similarity) between the ideal and teacher-induced gradients provides local, actionable signal about teacher utility, bypassing aggregate performance metrics that obscure such fine structure.

Main Findings

1. Gradient Alignment is Highest on Incorrect Trajectories

A strong and consistent pattern is observed across model scales and benchmarks: the teacher’s gradient is significantly more aligned with the ideal reference on incorrect trajectories than on correct ones. That is, when the student deviates from a path leading to success, the teacher’s influence more reliably directs probability mass toward tokens that are empirically associated with correct completion.

Figure 2: Distribution of gradient alignment as a function of path correctness, showing that alignment is notably higher on incorrect rollouts across teacher types and evaluation metrics.

2. Teacher Efficacy Depends on Student Capacity and Context Comprehensibility

A nuanced relationship emerges between teacher configuration and student model capacity. For smaller students (e.g., Qwen3-0.6B), self-distillation using correct in-context demonstrations (preferably in the model’s own style) is the most reliably aligned teacher. Larger external teachers’ signals are often less interpretable and hence less effective for these students. In contrast, larger students (Qwen3-1.7B) benefit more from external teachers as their capacity enables them to exploit distributional differences and nontrivial reasoning styles.

Summarization of demonstrations brings further complexity: while compressed demonstrations double alignment for large students, small models require full, stepwise traces—highlighting a clear interaction between context complexity and student comprehension.

Figure 3: Teacher ranking by gradient alignment for Qwen3-0.6B student on MMLU, illustrating the dominance of self-distillation with correct demonstrations.

Figure 4: Teacher ranking in additional settings; for 1.7B/BoolQ, external teachers outperform, while on MMLU, self-distillation is superior.

3. No Universally Optimal Teacher: Task and Instance Dependency

There is no single teacher or context design that is universally optimal. Optimal configuration shifts with model scale, dataset, and the inherent difficulty and reasoning structure of each task. For example, inclusion of contrastive (wrong) demonstrations may hurt on short-form tasks (BoolQ/MMLU) but help in complex settings (AIME 2025), where common errors are instructive. This mandates per-task and even per-instance diagnostic analysis.

Figure 5: Teacher ranking by gradient alignment for AIME 2025 questions; the best configuration varies with question complexity.

Additional Analyses

Predictors of Alignment

While one might hope for simple proxies (e.g., KL-divergence between student and teacher probabilities) to predict where teacher guidance is helpful, the study finds only weak correlations, indicating that high-divergence is necessary but not sufficient for positive alignment. Alignment also tends to improve with depth into a reasoning chain, corresponding to increased reasoning complexity relative to prompt/boilerplate regions.

Selective Distillation

An upper-bound analysis shows that, if one could restrict distillation updates only to tokens with positive alignment, the effective gradient budget is vastly improved even when using only ~50% of the updates. While such oracle filtering is infeasible at training time, it suggests promise in using divergence-based or outcome-sensitive heuristics.

Implications and Future Directions

Practically, these results caution against monolithic distillation pipelines. Instead, they favor adaptive, diagnostic-driven selection of teacher configurations, possibly modulated at the per-token or per-task level. The heightened utility of teacher signal on incorrect trajectories motivates algorithms that upweight distillation loss on failing rollouts. Multi-teacher or mixture-of-expert configurations may further boost alignment by combining complementary strengths.

Theoretically, all major distillation and reinforcement-style reward objectives are shown to share a common per-token gradient structure, supporting the use of alignment diagnostics for the entire family of token-level training algorithms. This unification will be fundamental as the community continues to search for efficient, scalable, and robust methods for post-training reasoning model improvement.

Conclusion

This paper establishes a rigorous, high-resolution diagnostic for evaluating on-policy distillation algorithms, providing concrete evidence that the benefit of teacher guidance is heterogeneous, primarily realized on student failures, and tightly coupled to capacity and context comprehensibility. No single teacher or context suffices across tasks and conditions. This framework should become standard for analyzing and developing future distillation approaches—motivating (i) adaptive, context-aware pipelines, (ii) real-time diagnostic-informed training strategies, and (iii) rigorous ablation for teacher selection at the per-token level.

The approach also sets a foundation for more general mechanistic interpretability studies of gradient-based learning signals in LLM training regimes.

Markdown Report Issue