Towards Hierarchical Multi-Step Reward Models for Enhanced Reasoning in Large Language Models

Published 16 Mar 2025 in cs.CL and cs.AI | (2503.13551v3)

Abstract: Recent studies show that LLMs achieve strong reasoning capabilities through supervised fine-tuning or reinforcement learning. However, a key approach, the Process Reward Model (PRM), suffers from reward hacking, making it unreliable in identifying the best intermediate step. In addition, the cost of annotating reasoning processes for reward modeling is high, making large-scale collection of high-quality data challenging. To address this, we propose a novel reward model approach called the Hierarchical Reward Model (HRM), which evaluates both individual and consecutive reasoning steps at both fine-grained and coarse-grained levels. HRM excels at assessing multi-step reasoning coherence, especially when flawed steps are later corrected through self-reflection. To further reduce the cost of generating training data, we introduce a lightweight and effective data augmentation strategy called Hierarchical Node Compression (HNC), which merges two consecutive reasoning steps into one within the tree structure. By applying HNC to MCTS-generated reasoning trajectories, we enhance the diversity and robustness of HRM training data while introducing controlled noise with minimal computational overhead. Empirical results on the PRM800K dataset show that HRM, together with HNC, provides more stable and reliable evaluations than PRM. Furthermore, cross-domain evaluations on the MATH500 and GSM8K datasets demonstrate HRM's strong generalization and robustness across a variety of reasoning tasks.

Abstract PDF Upgrade to Chat

Summary

The paper introduces HRM, a model that evaluates both individual and sequential reasoning steps to correct errors and enhance decision accuracy.
It proposes HNC, a novel data augmentation technique that merges MCTS nodes, reducing the need for expensive human annotation while boosting robustness.
Experiments on datasets like PRM800K, GSM8K, and MATH500 demonstrate HRM’s superior stability and generalization compared to traditional reward models.

HRM: Enhancing Reasoning in LLMs via Hierarchical Multi-Step Reward Models

This paper introduces the Hierarchical Reward Model (HRM) to improve the reasoning capabilities of LLMs. HRM addresses the limitations of Process Reward Models (PRM) by evaluating both individual and consecutive reasoning steps, enabling better assessment of reasoning coherence and self-reflection. Additionally, the paper introduces Hierarchical Node Compression (HNC) to enhance the efficiency and robustness of Monte Carlo Tree Search (MCTS) data augmentation.

Addressing Limitations of Existing Reward Models

The paper identifies key limitations in existing reward models used to enhance LLM reasoning. Outcome Reward Models (ORM) suffer from delayed feedback and credit assignment issues, making it difficult to determine which intermediate steps contribute to the final outcome. PRMs, while offering more granular feedback, are prone to reward hacking and require extensive manual annotation, leading to high costs and potential unreliability. HRM is proposed to mitigate these issues by incorporating both fine-grained and coarse-grained reasoning evaluations, allowing for self-reflection and error correction.

Figure 1: Illustration of how ORM, PRM, and HRM handle reasoning processes, with HRM considering multiple consecutive steps for error correction.

The Hierarchical Reward Model (HRM)

HRM differs from PRM by evaluating not only individual reasoning steps but also consecutive steps, enabling the reward model to assess multi-step reasoning coherence. This approach allows the model to identify and incorporate subsequent steps that rectify earlier errors, leading to more robust and reliable evaluations. The training dataset for HRM includes consecutive reasoning sequences, allowing the model to capture both fine-grained and coarse-grained reasoning consistency. Unlike PRM, HRM does not terminate evaluation upon encountering an error, but rather assesses whether subsequent steps correct earlier mistakes.

Hierarchical Node Compression (HNC) for MCTS Data Augmentation

The paper addresses the high cost of human-annotated supervision in training PRMs by introducing Hierarchical Node Compression (HNC), a data augmentation method for MCTS. HNC merges two consecutive nodes in the MCTS tree into a single node, expanding the training dataset while maintaining minimal computational overhead.

Figure 2: Diagram illustrating how HNC transforms the MCTS structure by merging two consecutive nodes into one.

HNC introduces controlled noise by randomly removing or merging consecutive nodes, enhancing the robustness of MCTS-based scoring. By consolidating nodes, HNC redistributes weights among the remaining nodes, improving the resilience of the scoring mechanism and diversifying the generated reasoning data.

Experimental Results and Analysis

The paper presents empirical results on the PRM800K dataset, demonstrating that HRM, in conjunction with HNC, achieves superior stability and reliability compared to PRM. Specifically, HRM's accuracy stabilizes at 80% as the number of reasoning trajectories ( $N$ ) increases, while PRM and ORM exhibit significant performance fluctuations. The experiments also evaluate the generalization of HRM trained on PRM800K using auto-labeled reasoning processes from MCTS and HNC, showing that HRM achieves superior reasoning consistency and generalizes effectively across GSM8K and MATH500 datasets.

Figure 3: An alternative illustration of Hierarchical Node Compression, emphasizing the transformation of the MCTS structure.

Self-Training with KL Divergence Regularization

The paper employs a self-training approach to filter high-quality reasoning data from MCTS, using either MC-Score or HRM to assign scores. To mitigate reward hacking, a high-quality data filter based on MC-Score is applied. The objective function combines causal language modeling loss with KL divergence regularization, using a weighting factor ( $\lambda$ ) to balance task-specific adaptation and retention of general capabilities. The paper notes that without proper KL regularization, the KL divergence can grow unbounded, necessitating the logarithmic scaling of the KL term to stabilize the loss landscape.

Figure 4: Visualization of loss dynamics during training with different KL loss weightings, showing the impact on KL loss and causal language modeling loss.

Implications and Future Directions

The HRM framework offers a more robust and reliable approach to reward modeling in LLMs, addressing the limitations of PRM and ORM. The introduction of HNC further enhances the efficiency and diversity of MCTS-based data augmentation, reducing the reliance on expensive human annotations. The empirical results demonstrate the effectiveness of HRM in improving reasoning consistency and generalization across different domains. Future research could explore the application of HRM in more complex reasoning tasks and investigate alternative data augmentation strategies to further enhance the robustness and efficiency of reward model training.

Conclusion

The paper makes significant contributions to the field of LLM reasoning by introducing HRM and HNC. These methods enhance the robustness, reliability, and generalization capabilities of LLMs in reasoning-intensive tasks. The combination of hierarchical reward modeling and efficient data augmentation offers a promising direction for future research in this area.