Papers
Topics
Authors
Recent
Search
2000 character limit reached

RM-R1: Reward Modeling as Reasoning

Published 5 May 2025 in cs.CL, cs.AI, and cs.LG | (2505.02387v3)

Abstract: Reward modeling is essential for aligning LLMs with human preferences through reinforcement learning from human feedback. To provide accurate reward signals, a reward model (RM) should stimulate deep thinking and conduct interpretable reasoning before assigning a score or a judgment. Inspired by recent advances of long chain-of-thought on reasoning-intensive tasks, we hypothesize and validate that integrating reasoning capabilities into reward modeling significantly enhances RMs interpretability and performance. To this end, we introduce a new class of generative reward models - Reasoning Reward Models (ReasRMs) - which formulate reward modeling as a reasoning task. We propose a reasoning-oriented training pipeline and train a family of ReasRMs, RM-R1. RM-R1 features a chain-of-rubrics (CoR) mechanism - self-generating sample-level chat rubrics or math/code solutions, and evaluating candidate responses against them. The training of RM-R1 consists of two key stages: (1) distillation of high-quality reasoning chains and (2) reinforcement learning with verifiable rewards. Empirically, our models achieve state-of-the-art performance across three reward model benchmarks on average, outperforming much larger open-weight models (e.g., INF-ORM-Llama3.1-70B) and proprietary ones (e.g., GPT-4o) by up to 4.9%. Beyond final performance, we perform thorough empirical analyses to understand the key ingredients of successful ReasRM training. To facilitate future research, we release six REASRM models along with code and data at https://github.com/RM-R1-UIUC/RM-R1.

Summary

  • The paper introduces Reasoning Reward Models that synthesize chain-of-thought traces to offer interpretable, task-dependent evaluations.
  • It combines a distillation phase from oracle reasoning with reinforcement learning using GRPO to optimize explicit rubrics.
  • Empirical results show up to 13.8% accuracy gains and robust generalization across chat, safety, and reasoning domains.

RM-R1: Reward Modeling as Reasoning

Motivation and Background

Reward modeling is central to aligning LLM behavior with human preferences in RLHF. Traditional reward models can be classified as scalar RMs (producing opaque numerical scores) and GenRMs (generating textual judgments). However, these paradigms either lack interpretability or perform superficial reasoning, which impedes robust evaluation across diverse domains—especially in generalist settings where rubrics are nuanced and multidimensional. The paper "RM-R1: Reward Modeling as Reasoning" (2505.02387) introduces Reasoning Reward Models (ReasRMs), casting reward modeling fundamentally as a reasoning task. By integrating chain-of-thought (CoT) structures and explicit rubrics into the evaluation process, this approach promises both enhanced interpretability and performance.

Reasoning-Based Training Pipeline

The RM-R1 methodology comprises two central stages:

  1. Distillation: Starting from an instruction-tuned LLM, long and structured reasoning traces are synthesized through prompting strong oracle models (e.g., Claude, O3). These traces justify preference labels with explicit, task-dependent rubrics, justifications, and content-based evaluations.
  2. Reinforcement Learning (RL): Following distillation, the model is further optimized using preference datasets, leveraging verifiable rewards primarily based on correctness. Group Relative Policy Optimization (GRPO) is adopted for efficient policy refinement, with KL-regularization anchoring outputs to the reference model.

Distillation is critical: RL alone biases towards superficial features and causes instability, as shown empirically (see training dynamics section). Figure 1

Figure 1: RM-R1's pipeline combines distillation from high-quality reasoning traces with RL optimization, converting GenRMs into ReasRMs.

Chain-of-Rubrics and Reasoning Architecture

RM-R1 elicits task-type classification (Reasoning vs. Chat), generating nuanced rubrics for chat tasks and direct problem-solving for reasoning tasks. For each prompt, the model produces:

  • Task classification
  • Rubric set and weighted justification (for chat)
  • Stepwise evaluation, using CoT reasoning
  • Explicit final judgment

This structure supports fine-grained, interpretable reward signals, enabling rigorous evaluation of candidate responses beyond mere surface patterns. Figure 2

Figure 2: Off-the-shelf models overfit superficial data patterns, whereas reasoning-augmented reward modeling generalizes to deep evaluative criteria, including emotional harm and context nuances.

Empirical Performance and Analysis

RM-R1 sets or matches SOTA performance across RewardBench, RM-Bench, and RMB benchmarks—outperforming much larger models (Llama3.1-405B, GPT-4o) by up to 13.8% in accuracy. Its interpretability enables robust discrimination even in subtle content differences and style biases, and it exhibits strong generalization across chat, safety, and reasoning domains. The model achieves near-linear scaling improvements: larger models (32B) yield higher absolute gains and benefit more from increased inference-time compute (up to 8192 tokens). Figure 3

Figure 3

Figure 3: Performance improvements as a function of model size confirm strong scaling laws for reasoning reward models.

Ablation and Case Study Insights

Ablation studies reveal that:

  • RL alone (cold start) is insufficient, causing instability and overfitting to length or superficial criteria.
  • Explicit task-type classification (chain-of-rubrics) significantly enhances reasoning performance.
  • Distillation from high-quality reasoning traces further boosts both generalization and safety assessment.
  • Reasoning-based models consistently outperform SFT-only models, even with limited data.

Case studies highlight RM-R1's capacity to generate task-dependent rubrics, prioritize high-impact criteria (e.g., accuracy for medical queries), and faithfully adhere to rubrics for content-based judgments. RM-R1 also produces long, coherent reasoning traces, improving clarity, stability, and reward quality. Figure 4

Figure 4

Figure 4: Cold start RL fails to discover optimal rubrics and reasoning, underscoring the necessity of distillation for warm-start RL pipelines.

Practical and Theoretical Implications

The adoption of reasoning-centric reward modeling offers several practical and theoretical benefits:

  • Interpretability: Explicit reasoning traces and rubrics enable granular analysis, auditability, and debugging of reward signals.
  • Generalization and Robustness: Structured reasoning allows models to evaluate diverse tasks across domains with reliable adherence to human-like rubrics.
  • Scalability: Larger models and longer inference budgets directly translate to improved reward modeling performance, validating scaling laws for CoT-based supervision.
  • Data Efficiency: RM-R1 trains with substantially less data than comparable alternatives, lowering barrier for high-performing reward models.

Theoretically, formulating reward modeling as a reasoning process bridges the gap between opaque scalar feedback and interpretable judgment, establishing new standards for RLHF pipelines.

Future Directions

  • Rubric Library Induction: Automatic compilation and re-use of rubric sets may reduce rollout lengths and improve sample efficiency.
  • Active Preference Collection: ReasRMs can engage in active learning, querying human feedback only when the available rubrics are insufficient.
  • Multimodal and Agentic Extensions: Reward modeling methodologies can be expanded to accommodate multimodal evaluation and agentic behaviors in complex tasks.

Conclusion

RM-R1 demonstrates that reasoning-centric approaches substantially strengthen reward modeling, yielding interpretable judgments, robust performance, and reliable alignment across diverse domains. Its modular training pipeline synthesizes structured reasoning traces and leverages CoT rollouts through RL, setting new technical benchmarks in reward modeling. The findings motivate further investigations into rubric induction, preference collection, and multimodal alignment for future RLHF paradigms.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 10 tweets with 90 likes about this paper.