Process reward modeling for intermediate steps without direct supervision

Develop a principled process reward modeling scheme in reinforcement learning with verifiable rewards that assigns appropriate rewards to intermediate reasoning steps without direct step-level supervision and in a manner independent of the chosen step segmentation.

Background

Beyond final-answer verification, RLVR could benefit from process-level rewards that evaluate intermediate reasoning steps. However, when step-level ground-truth is unavailable, it is unclear how to design reward assignment policies that do not depend on arbitrary step segmentation.

The authors explicitly raise this as a related open question connected to the broader uncertainty about the necessity of CoT rationales, highlighting the need for methods to assign rewards to reasoning steps in the absence of direct supervision.

References

While CoT has proven useful in both reference-based~\citep{team2025kimi} and reference-free~\citep{zhang2024generative} settings, it remains an open question how necessary in-depth rationales are for assessing semantic equivalence between reference answers and model responses in the same language, particularly when focusing on the conclusive part of each response. This also raises a related question for process reward modeling~\citep{lightman2023let} in RLVR: how should rewards be assigned when there is no direct supervision for intermediate steps, regardless of the step segmentation method?

Crossing the Reward Bridge: Expanding RL with Verifiable Rewards Across Diverse Domains  (2503.23829 - Su et al., 31 Mar 2025) in Section 7 (Discussions and Conclusions)