Self-Distilled RLVR: Stable Credit Assignment Without Leakage

This presentation examines a critical failure mode in reinforcement learning for reasoning language models and introduces RLSD, a principled solution that decouples direction from magnitude in token-level credit assignment. The talk reveals how on-policy self-distillation suffers from irreducible information leakage that causes performance collapse, then demonstrates how RLSD structurally eliminates this pathology while achieving superior results on multimodal reasoning benchmarks.
Script
Training reasoning models with reinforcement learning faces a hidden trap. When you let a model teach itself using privileged information it won't have at test time, something insidious happens: the model starts leaking that privileged context into its predictions, and performance collapses after initial gains.
On-policy self-distillation creates an asymmetric information flow. The teacher version of the model observes the correct answer while generating predictions, but the student must operate without it. This gap is not just a performance issue, it is formally irreducible, quantifiable as mutual information that cannot be matched, and it directly contaminates gradient estimates with spurious mappings from input to privileged context.
The researchers introduce RLSD to structurally resolve this failure mode.
RLSD preserves the directional integrity of standard reinforcement learning with verifier rewards, but exploits the privileged teacher signal exclusively for magnitude modulation. Per-token weights are computed as the exponential of the log probability difference between teacher and student, scaled by the sign of the sequence-level advantage. This Bayesian-grounded design ensures that tokens advancing correct reasoning receive proportionally stronger credit, while direction remains anchored to the external reward.
The empirical evidence is striking. While on-policy self-distillation peaks early then deteriorates, and standard GRPO climbs slowly, RLSD combines the best of both worlds. It converges faster than GRPO while maintaining the stability that self-distillation sacrifices. At 200 training steps, RLSD already surpasses GRPO trained for twice as long, demonstrating both efficiency and a fundamentally higher performance ceiling on multimodal reasoning tasks.
On Qwen3-VL-8B-Instruct, RLSD achieves the highest accuracy across MathVista, MathVision, and related datasets, with the largest gains on tasks requiring fine-grained logical reasoning. Token-level attribution analysis reveals that RLSD precisely credits tokens controlling correct transitions and penalizes those introducing errors. Crucially, the framework is immune to leakage both at gradient and support levels, resolving a structural trilemma that has constrained all prior distribution-matching self-distillation methods.
RLSD demonstrates that dense, privileged supervision and stable policy optimization are not mutually exclusive. By separating what guides direction from what tunes magnitude, the framework opens a path for scalable, verifiable reasoning without the computational overhead of external reward models. Visit EmergentMind.com to explore this paper further and create your own research video.