SDPO for alignment in open-ended or continuous‑reward settings
Determine whether Self‑Distillation Policy Optimization (SDPO), which distills a feedback‑conditioned self‑teacher into the policy, improves alignment in open‑ended text generation and in continuous‑reward tasks that lack a ground‑truth verifier, by empirically evaluating its retrospection‑based credit assignment in such settings.
References
While we focused on verifiable code generation, many tasks provide textual feedback without a ground-truth verifier. Investigating whether SDPO's retrospection mechanism can improve alignment in open-ended text generation or continuous-reward tasks remains an open empirical question.
— Reinforcement Learning via Self-Distillation
(2601.20802 - Hübotter et al., 28 Jan 2026) in Conclusion, Limitations, and Future Work — Future Work (Beyond verifiable rewards)