Process reward modeling for intermediate steps without direct supervision
Develop a principled process reward modeling scheme in reinforcement learning with verifiable rewards that assigns appropriate rewards to intermediate reasoning steps without direct step-level supervision and in a manner independent of the chosen step segmentation.
References
While CoT has proven useful in both reference-based~\citep{team2025kimi} and reference-free~\citep{zhang2024generative} settings, it remains an open question how necessary in-depth rationales are for assessing semantic equivalence between reference answers and model responses in the same language, particularly when focusing on the conclusive part of each response. This also raises a related question for process reward modeling~\citep{lightman2023let} in RLVR: how should rewards be assigned when there is no direct supervision for intermediate steps, regardless of the step segmentation method?