Fast and Effective On-policy Distillation from Reasoning Prefixes

This presentation explores a breakthrough in efficient language model distillation. The authors introduce a method that focuses supervision on only the early tokens of model responses, reducing computational cost by up to 47 times while maintaining performance. By strategically applying feedback to reasoning prefixes and gradually increasing their length during training, this approach overcomes the resource-intensive nature of traditional on-policy distillation without sacrificing the quality gains that make on-policy learning valuable.
Script
Training a smaller language model to match a larger teacher has always required a brutal tradeoff: use the teacher's pre-generated responses and risk exposure bias, or sample fresh responses on-policy and watch your compute bill explode. The authors of this paper found a way out by asking a deceptively simple question: what if we only need to supervise the beginning?
On-policy distillation solves exposure bias by letting the student learn from its own generated trajectories, receiving dense token-level feedback from the teacher. But there's a cost: every training step requires generating complete responses, and for reasoning tasks with long chains of thought, that cost becomes unsustainable.
The breakthrough came from recognizing where the learning signal actually lives.
The authors discovered that aligning the first few tokens of a response can guide the entire generation toward correctness. By supervising only prefixes and terminating sampling early, they cut the computational overhead while preserving the on-policy learning benefit. A scheduled increase in prefix length ensures the student progressively learns longer-horizon reasoning.
The contrast is stark. While traditional on-policy distillation supervises every token in lengthy responses, prefix distillation focuses feedback on where it matters most. The result is a method that preserves learning quality while slashing computational requirements by up to 47 times.
The authors validated their approach on mathematical reasoning tasks and beyond. Prefix on-policy distillation matched the performance of full on-policy methods while delivering computational savings that make the technique practical for real-world deployment, even in resource-constrained environments.
But the method isn't without trade-offs.
Very short prefixes risk under-supervising the tail of responses, which could matter for safety-critical applications. Yet the efficiency gains democratize on-policy distillation, and future work on adaptive scheduling could refine when and how much supervision each segment of a response receives.
This work does more than accelerate a training technique. It reveals that the architecture of reasoning in language models front-loads critical decisions into early tokens, and that insight reshapes how we think about supervision efficiency. By making on-policy distillation practical, the authors have unlocked a pathway for smaller, faster models that retain the reasoning depth of their larger teachers.
Supervision doesn't have to be exhaustive to be effective. Sometimes the first few words contain everything you need to get the rest right. Visit EmergentMind.com to explore more research that rethinks the fundamentals of how models learn.