Self-Distillation Policy Optimization (SDPO)
- SDPO is a reinforcement learning paradigm where a single model acts as both student and teacher by utilizing rich, self-generated feedback signals.
- It incorporates KL divergence-based losses and per-token credit assignment across applications like sequence generation, diffusion models, and classic RL control.
- Empirical results show that SDPO improves sample efficiency, convergence speed, and robustness in tasks ranging from mathematical reasoning to biomolecular design.
Self-Distillation Policy Optimization (SDPO) refers to a family of reinforcement learning (RL) and fine-tuning algorithms in which a single model—by leveraging contextualization, privileged information, or delayed/lazy copies—serves concurrently as both the "student" and a form of "teacher." SDPO uses self-generated signals, such as rich feedback, ground-truth traces, or reward-weighted policy variants, as dense learning targets. This approach is applicable across discrete sequence generation, diffusion models, and classic RL control, and has been shown to improve sample efficiency, credit-assignment granularity, and robustness compared to conventional RL with sparse rewards or standard (teacher-led) distillation.
1. Problem Settings and Motivations
SDPO is motivated by the limitations of conventional RL methods that operate with terminal, often binary, scalar rewards. In sequence modeling domains (such as code generation or mathematical reasoning), feedback from the environment often contains rich, structured information—e.g., runtime errors, judge comments, or ground-truth reasoning traces—beyond mere pass/fail signals (Hübotter et al., 28 Jan 2026, Zhao et al., 26 Jan 2026). In high-dimensional generative modeling (e.g., biomolecular design via diffusion models), target rewards can be non-differentiable and specific to scientific desiderata (Su et al., 1 Jul 2025).
Typical SDPO problem formulations include:
- Sequence generation as an MDP:
- State : prompt or context (e.g., a programming problem).
- Action sequence : generated tokens.
- Environment transition: token appending.
- Reward : terminal binary or continuous signal.
- Reinforcement Learning with Rich Feedback (RLRF):
- Augments classic RLVR (Reinforcement Learning with Verifiable Rewards) with per-attempt textual feedback , utilized for in-situ policy improvement (Hübotter et al., 28 Jan 2026).
- Diffusion policy fine-tuning:
- Model produces via denoising steps; reward is only revealed on completion (Su et al., 1 Jul 2025).
The SDPO paradigm enables more informative, context-aware credit assignment and allows for self-improvement without the need for costly, separate, external teacher models.
2. SDPO Objectives and Algorithmic Variants
SDPO generalizes the policy optimization objective by integrating a KL divergence–based distillation loss computed between alternative conditionings or evolutions of the same model, optionally regularized or combined with traditional RL objectives.
2.1 RL Sequence Models with Self-Distillation
The typical SDPO loss in an RL-rich feedback setting is: where conditions the same model on both the original prompt and feedback string ("self-teacher"). The algorithm samples a batch of rollouts, collects feedback, computes log-probabilities for both student and teacher conditionings, and minimizes the sum of per-token, per-sample KL divergences (Hübotter et al., 28 Jan 2026).
2.2 On-Policy Self-Distillation from Privileged Context
For settings where ground-truth solutions are available, such as mathematical reasoning: with
- Student:
- Teacher:
This framework enforces token-level agreement between the student's own rollouts and the teacher's distribution, privileged by access to solution traces (Zhao et al., 26 Jan 2026).
2.3 Classic RL with Self-Distillation Regularization
Proximal Policy Distillation (PPD)—which specializes to SDPO when the student and teacher share architecture—augments the PPO surrogate loss with a distillation term: Here, and rollouts are always collected with the student (Spigler, 2024).
2.4 Diffusion Models: Reward-Weighted Self-Distillation
Fine-tuning diffusion models under non-differentiable rewards uses an iterative (off-policy) self-distillation loop: where is a reward-weighted teacher: and is a softening temperature (Su et al., 1 Jul 2025). The policy being distilled is a delayed, value-weighted variant of the original student.
3. Theoretical Properties and Credit Assignment
The distinctive feature of SDPO is its logit- or token-level credit assignment capability. Standard policy-gradient methods (e.g., REINFORCE, PPO) propagate a single sequence-level or group-relative advantage to all tokens in a rollout, which limits granularity and slows learning in long-horizon or sparse-reward settings (Hübotter et al., 28 Jan 2026).
By contrast, SDPO computes dense, per-token advantages: for self-distillation from feedback, or
for privileged-context distillation. These advantages permit sharply localized credit assignment—strengthening or suppressing specific logit activations according to the model's retrospective evaluation conditioned on richer feedback or solution traces (Zhao et al., 26 Jan 2026, Hübotter et al., 28 Jan 2026).
SDPO with off-policy/forward-KL (e.g., in diffusion models) promotes mode covering and is less prone to mode collapse compared to on-policy/reverse-KL optimization, facilitating stable exploration and more reliable policy improvement (Su et al., 1 Jul 2025).
4. Empirical Results Across Domains
SDPO yields consistent benefits across diverse benchmarks and modalities.
| Domain/Task | Metric | SDPO Gain Over Baselines |
|---|---|---|
| Scientific reasoning | avg@16 | +5–25 pp accuracy, 3× shorter output |
| Competitive programming | pass@1, discovery@k | ¼ the generations, +7.6 pp accuracy |
| Tool use | avg@16 | Faster convergence |
| Mathematical reasoning | token-efficiency, acc | 4–8× better token efficiency |
| Atari, MuJoCo, Procgen | Geometric mean returns | +0.01–0.21× teacher (self-distill) |
| Biomolecular design | Reward, diversity | 2×–10× reward, stable diversity |
Specific findings include:
- On chemistry reasoning, Qwen3-8B achieved 70.1% avg@16 after 5 h with SDPO versus 60.0% for GRPO; Olmo3-7B-Instruct reached 76.8% after 5 h (GRPO: 54.3%) (Hübotter et al., 28 Jan 2026).
- On LCB v6, SDPO attained 48.8% final versus GRPO 41.2%, matching GRPO's accuracy in ¼ of the generations (Hübotter et al., 28 Jan 2026).
- On mathematical reasoning, SDPO matched or exceeded GRPO while using only 1 rollout (vs. 8 for GRPO) and 4–8× fewer tokens (Zhao et al., 26 Jan 2026).
- In diffusion model-based biomolecular design, reward markers improved by 2×–10×, with convergence in half the queries of PPO baselines and no loss in diversity (Su et al., 1 Jul 2025).
- In classic RL tasks, SDPO achieved geometric mean performance of 1.19× teacher on Atari, with robustness to noisy teachers, exceeding both on-policy and off-policy vanilla distillation (Spigler, 2024).
5. Algorithmic Implementation and Hyperparameters
Key practical elements for successful SDPO deployment include:
- Batch size: Larger batches (32) aid stability; smaller (8–16) are effective for low-budget test-time distillation (Hübotter et al., 28 Jan 2026).
- KL computation: Token-wise/full-logit KL provides finer credit assignment versus sequence-level KL; top-K softmax approximations (K=20–100) offer memory savings with negligible accuracy drop (Hübotter et al., 28 Jan 2026).
- Teacher regularization: EMA rate –$0.05$ or explicit trust-region mixing with a reference checkpoint stabilizes training (Hübotter et al., 28 Jan 2026).
- Optimizer: AdamW with learning rates – (LLMs), (RL, diffusion) (Spigler, 2024, Hübotter et al., 28 Jan 2026, Su et al., 1 Jul 2025).
- Generation setup: In LLMs, max prompt length , response length up to 8192, temperature 1.0 (train), 0.6 (val) (Hübotter et al., 28 Jan 2026, Zhao et al., 26 Jan 2026).
- RL settings: PPO clip , –$0.999$, GAE (Spigler, 2024).
- Diffusion parameters: Temperature carefully tuned between stability and reward sharpness; roll-in schedule anneals exploration to exploitation (Su et al., 1 Jul 2025).
Hardware implementations include 8×A100 GPUs, LoRA rank 8, bfloat16, gradient checkpointing, and FlashAttention2 for LLM-scale experiments (Zhao et al., 26 Jan 2026, Hübotter et al., 28 Jan 2026).
6. Extensions and Open Directions
SDPO is extensible in several directions:
- Hybridization with policy gradients: Linear interpolation between SDPO and group-normalized policy-gradient (GRPO) advantages can stabilize weaker models (Hübotter et al., 28 Jan 2026).
- Long-horizon and agentic feedback: SDPO frameworks accepting intermediate feedback (not just terminal) are being explored (Hübotter et al., 28 Jan 2026).
- Open-ended RL and non-verifiable tasks: Utilizing LLM-generated judge comments or other textual feedback in less structured environments (Hübotter et al., 28 Jan 2026).
- Scaling laws: Empirical gains from SDPO consistently increase with model scale, suggesting emergent benefits for very large models and multi-task RL (Hübotter et al., 28 Jan 2026, Zhao et al., 26 Jan 2026).
- Diffusion and off-policy RL: Off-policy SDPO variants using forward-KL objectives (as in diffusion model fine-tuning) maintain diversity and stability, outperforming on-policy RL in mode covering and sample efficiency (Su et al., 1 Jul 2025).
A plausible implication is that as environments and target objectives become more complex, the capacity for in-parameter, context-shifted "self-supervision" enables more resilient and scalable policy optimization than classical teacher-student or reward-only approaches.
7. Comparative Perspective and Practical Considerations
SDPO lies at the intersection of RL, knowledge distillation, and self-supervised learning. Its distinguishing features are:
- Self-improvement without external oracles: The teacher role is filled via feedback conditioning, solution traces, or reward reweighting applied to the model's own weights (Hübotter et al., 28 Jan 2026, Zhao et al., 26 Jan 2026, Su et al., 1 Jul 2025).
- Dense, per-token credit assignment: Improves convergence speed and reduces verbosity, particularly in sequence generation (Hübotter et al., 28 Jan 2026).
- Robustness: SDPO demonstrates resilience to imperfect signals and can recover or surpass original teacher performance in classic RL (Spigler, 2024).
- Sample/token efficiency: By exploiting rich, structured feedback, SDPO achieves higher accuracy and efficiency per environment query or token generated (Zhao et al., 26 Jan 2026, Su et al., 1 Jul 2025).
- Avoidance of mode collapse: Off-policy, forward-KL formulations in diffusion models maintain solution diversity even under sharp reward weighting (Su et al., 1 Jul 2025).
In summary, SDPO operationalizes the principle that models can meta-learn from their own errors or alternate perspectives, transforming environment feedback, privileged context, or reward-weighted subpolicies into dense, actionable learning signals—thereby accelerating and stabilizing policy improvement across diverse and challenging RL domains.