Proof-RM: A Scalable and Generalizable Reward Model for Math Proof

Published 2 Feb 2026 in cs.CL | (2602.02377v1)

Abstract: While LLMs have demonstrated strong math reasoning abilities through Reinforcement Learning with Verifiable Rewards (RLVR), many advanced mathematical problems are proof-based, with no guaranteed way to determine the authenticity of a proof by simple answer matching. To enable automatic verification, a Reward Model (RM) capable of reliably evaluating full proof processes is required. In this work, we design a scalable data-construction pipeline that, with minimal human effort, leverages LLMs to generate a large quantity of high-quality "question-proof-check" triplet data. By systematically varying problem sources, generation methods, and model configurations, we create diverse problem-proof pairs spanning multiple difficulty levels, linguistic styles, and error types, subsequently filtered through hierarchical human review for label alignment. Utilizing these data, we train a proof-checking RM, incorporating additional process reward and token weight balance to stabilize the RL process. Our experiments validate the model's scalability and strong performance from multiple perspectives, including reward accuracy, generalization ability and test-time guidance, providing important practical recipes and tools for strengthening LLM mathematical capabilities.