Reliable assessment of intermediate reasoning steps without ground-truth labels

Establish robust methods to reliably assess the correctness of intermediate reasoning steps in solutions to mathematical reasoning and related tasks when step-level ground-truth labels are unavailable, so these assessments can be used for training and evaluation in reinforcement learning with verifiable rewards.

Background

In reinforcement learning with verifiable rewards (RLVR), prior work and benchmarks commonly judge correctness based on the final answer because intermediate step labels are rarely available in real-world settings. This practice reflects a broader difficulty in step-level verification.

The paper explicitly notes that reliably assessing the correctness of individual reasoning steps is an unresolved issue, particularly in scenarios lacking ground-truth labels for these steps, underscoring the need for principled methods that can evaluate intermediate steps in reasoning-intensive tasks.

References

This is because reliably assessing the correctness of individual steps remains an open challenge, particularly when these steps may lack ground-truth labels in real-world scenarios.

Crossing the Reward Bridge: Expanding RL with Verifiable Rewards Across Diverse Domains  (2503.23829 - Su et al., 31 Mar 2025) in Section 2.1 (Related Work: Reward Estimation in Reinforcement Learning with Verifiable Rewards)