Reliable assessment of intermediate reasoning steps without ground-truth labels
Establish robust methods to reliably assess the correctness of intermediate reasoning steps in solutions to mathematical reasoning and related tasks when step-level ground-truth labels are unavailable, so these assessments can be used for training and evaluation in reinforcement learning with verifiable rewards.
References
This is because reliably assessing the correctness of individual steps remains an open challenge, particularly when these steps may lack ground-truth labels in real-world scenarios.
— Crossing the Reward Bridge: Expanding RL with Verifiable Rewards Across Diverse Domains
(2503.23829 - Su et al., 31 Mar 2025) in Section 2.1 (Related Work: Reward Estimation in Reinforcement Learning with Verifiable Rewards)