Stabilizing LUFFY on hard problems with human reference solutions
Develop training procedures or algorithmic modifications that stabilize the LUFFY (Learning to Reason under Off-Policy Guidance) method on hard reasoning problems with human reference solutions to enable reliable empirical comparison and evaluation in this setting.
References
We also attempted to compare to LUFFY~\citep{yan2025learning}, which incorporates the oracle solution directly as a rollout during RL but were unable to make it train stably on our hard problems with human reference solutions; hence we skip this comparison for now.
— POPE: Learning to Reason on Hard Problems via Privileged On-Policy Exploration
(2601.18779 - Qu et al., 26 Jan 2026) in Section 6 (Experimental Evaluation), Result 4