Stabilizing LUFFY on hard problems with human reference solutions

Develop training procedures or algorithmic modifications that stabilize the LUFFY (Learning to Reason under Off-Policy Guidance) method on hard reasoning problems with human reference solutions to enable reliable empirical comparison and evaluation in this setting.

Background

The authors sought to compare POPE against LUFFY, an approach that incorporates oracle solutions as rollouts during RL. However, in their hard-problem setup with human reference solutions, they could not achieve stable training for LUFFY and thus omitted the comparison.

Resolving training instability would clarify LUFFY’s performance relative to POPE on the same hard-problem regime and inform best practices for leveraging oracle rollouts.

References

We also attempted to compare to LUFFY~\citep{yan2025learning}, which incorporates the oracle solution directly as a rollout during RL but were unable to make it train stably on our hard problems with human reference solutions; hence we skip this comparison for now.

POPE: Learning to Reason on Hard Problems via Privileged On-Policy Exploration  (2601.18779 - Qu et al., 26 Jan 2026) in Section 6 (Experimental Evaluation), Result 4