GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning

Published 1 Apr 2025 in cs.CL | (2504.00891v2)

Abstract: Recent advancements in LLMs have shown that it is promising to utilize Process Reward Models (PRMs) as verifiers to enhance the performance of LLMs. However, current PRMs face three key challenges: (1) limited process supervision and generalization capabilities, (2) dependence on scalar value prediction without leveraging the generative abilities of LLMs, and (3) inability to scale the test-time compute of PRMs. In this work, we introduce GenPRM, a generative process reward model that performs explicit Chain-of-Thought (CoT) reasoning with code verification before providing judgment for each reasoning step. To obtain high-quality process supervision labels and rationale data, we propose Relative Progress Estimation (RPE) and a rationale synthesis framework that incorporates code verification. Experimental results on ProcessBench and several mathematical reasoning tasks show that GenPRM significantly outperforms prior PRMs with only 23K training data from MATH dataset. Through test-time scaling, a 1.5B GenPRM outperforms GPT-4o, and a 7B GenPRM surpasses Qwen2.5-Math-PRM-72B on ProcessBench. Additionally, GenPRM demonstrates strong abilities to serve as a critic model for policy model refinement. This work establishes a new paradigm for process supervision that bridges the gap between PRMs and critic models in LLMs. Our code, model, and data will be available in https://ryanliu112.github.io/GenPRM.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a generative approach that reframes process reward models as verifiers using explicit chain-of-thought reasoning.
It integrates code-based verification for step-level evaluation, reducing dependency on extensive training data while improving output accuracy.
The model scales test-time compute effectively, enabling smaller LLMs to outperform larger models on challenging reasoning benchmarks.

GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning

Introduction

"GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning" proposes a novel generative approach to enhancing Process Reward Models (PRMs) for LLMs. Traditional PRMs face constraints on process supervision and generalization, stemming partially from reliance on scalar value predictions. GenPRM reimagines PRMs as verifiers that leverage explicit Chain-of-Thought (CoT) reasoning, integrating code verification to bolster step-level evaluation in reasoning tasks. This transforms the task of process supervision into a generative modeling problem, providing productive avenues for scaling test-time compute to improve LLM performance in complex scenarios.

GenPRM Framework

GenPRM introduces a generative process verification architecture, which incorporates CoT reasoning and code-based verification:

Chain-of-Thought Reasoning: GenPRM utilizes multistep natural language reasoning, which translates complex mathematical or logical steps into understandable sequences. This reasoning framework supports improved process supervision.
Code Verification: Beyond natural language, GenPRM executes code to verify the correctness of each step. Such execution enables robust validation by providing numerical feedback that informs subsequent generative predictions.
Relative Progress Estimation (RPE): To synthesize high-quality labels for supervision, GenPRM employs RPE to assess and label stepwise progress. This is crucial for refining label quality in datasets and enhances the overall robustness of the generative model.
Figure 1: Overall framework of GenPRM.

Experimentation and Results

GenPRM demonstrates substantial improvements over prior classification-focused PRMs in various mathematical reasoning benchmarks, including ProcessBench and MATH. The model achieves these improvements using significantly less training data due to its generative nature.

Performance Metrics: The results showcase that with a parameter count of 1.5B, GenPRM rivals or exceeds the capabilities of larger, more resource-intensive models such as GPT-4o and Qwen2.5-Math-PRM-72B through techniques like majority voting and Best-of-N sampling.

Figure 2: BoN results with different generation models.

Test-Time Scaling: By exploiting the scalability of its generative process, the GenPRM framework enables smaller models to outperform what were previously best-in-class models by scaling compute efficiently.

Practical Implications and Future Directions

Scalability: GenPRM establishes a paradigm where smaller models effectively compete or surpass larger counterparts by optimizing test-time compute operations. This scalability reduces dependency on excessively large models, creating potential cost efficiencies and resource savings.
Critique Framework: The incorporation of code-based verification positions GenPRM as a potential critic model that can refine LLM outputs dynamically. This extends its utility beyond static verification into real-time, adaptive processing tasks.
Broader Applicability: While this work emphasizes mathematical reasoning, the generative reasoning approach holds promise for other domains reliant on process-level supervision, such as coding or multimodal tasks.

Conclusion

GenPRM represents a shift towards generative reasoning mechanisms for validating and improving LLM performance. By leveraging the generative capacities of LLMs in conjunction with explicit reasoning techniques and code execution, GenPRM advances the scalability and usability of PRM frameworks. Future research can enhance this paradigm by exploring dynamic pruning methods and adaptations to other knowledge domains, thus further extending the impact of generative reasoning on LLM capabilities.