R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement Learning

Published 5 May 2025 in cs.CV and cs.CL | (2505.02835v2)

Abstract: Multimodal Reward Models (MRMs) play a crucial role in enhancing the performance of Multimodal LLMs (MLLMs). While recent advancements have primarily focused on improving the model structure and training data of MRMs, there has been limited exploration into the effectiveness of long-term reasoning capabilities for reward modeling and how to activate these capabilities in MRMs. In this paper, we explore how Reinforcement Learning (RL) can be used to improve reward modeling. Specifically, we reformulate the reward modeling problem as a rule-based RL task. However, we observe that directly applying existing RL algorithms, such as Reinforce++, to reward modeling often leads to training instability or even collapse due to the inherent limitations of these algorithms. To address this issue, we propose the StableReinforce algorithm, which refines the training loss, advantage estimation strategy, and reward design of existing RL methods. These refinements result in more stable training dynamics and superior performance. To facilitate MRM training, we collect 200K preference data from diverse datasets. Our reward model, R1-Reward, trained using the StableReinforce algorithm on this dataset, significantly improves performance on multimodal reward modeling benchmarks. Compared to previous SOTA models, R1-Reward achieves a $8.4\%$ improvement on the VL Reward-Bench and a $14.3\%$ improvement on the Multimodal Reward Bench. Moreover, with more inference compute, R1-Reward's performance is further enhanced, highlighting the potential of RL algorithms in optimizing MRMs.

Abstract PDF Upgrade to Chat

Summary

R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement Learning

The paper presents R1-Reward, a novel approach to training Multimodal Reward Models (MRMs) utilizing stable reinforcement learning techniques, significantly improving performance metrics in multimodal reasoning tasks. The primary contribution is the introduction of the StableReinforce algorithm, designed to address intrinsic limitations of existing reinforcement learning methods like Reinforce++ in the context of reward modeling for Multimodal LLMs (MLLMs).

Key Elements of the Study

Problem Redefinition and Approach:
- The overarching problem is reframing multimodal reward modeling as a rule-based reinforcement learning task. The study identifies the deficiencies in current RL methods, notably the propensity for training instability and collapse. Traditional algorithms such as PPO and variations like Reinforce++ often fail due to significant updates in policy ratios and unstable advantage normalization.
StableReinforce Algorithm:
- Pre-CLIP Strategy: By integrating a pre-clipping mechanism into the fitness update routine, StableReinforce mitigates the risk of numerical instability associated with exponential scaling of log-probability ratios.
- Advantage Filter: This addresses issues in advantage estimation by filtering out statistical outliers, ensuring stable training convergence.
- Consistency Reward: Introducing a consistency check in reward calculation bridges the gap between reasoning and resulting outputs, rectifying discrepancies observed in model reasoning processes versus final decisions.
Data Collection and Experimental Setup:
- The paper compiles a diverse multimodal preference dataset (R1-Reward-200K) and employs a self-supervised fine-tuning (SFT) onset using generated reasoning annotations from GPT-4o.
- A phased reinforcement learning paradigm is adopted to gradually enhance the model's reasoning capabilities using increasingly complex datasets derived from rigorous sampling techniques.

Empirical Results

The R1-Reward model exhibits notable advancements across multiple established benchmarks:

VL Reward Benchmarks: It achieves a marked improvement in overall accuracy by approximately 9.3% over leading existing models, demonstrating high data efficiency in reward modeling.
Multimodal Reward Bench Benchmarks: The improvement is calibrated at 14.3%, further underscoring the model's capability to generalize across diverse evaluation datasets.

Implications and Future Prospects

From a practical standpoint, R1-Reward's capabilities facilitate enhanced selection procedures in test-time scaling using efficient majority strategy voting techniques. This suggests a potential paradigm shift in optimizing MLLMs for real-world applications, promising enhanced generalization capabilities.

The paper signifies a forward leap in exploring reinforcement learning paradigms for MLLM alignment. It implicates prospective research avenues in refining reinforcement learning models for broader applications in artificial intelligence, including enhanced interpretability and reasoning coherence of MLLMs.

In conclusion, this study provides substantial evidence of the benefits and applicability of stable reinforcement learning in MRMs. Its methodologies lay foundational work for future endeavors in bridging long-term reasoning capabilities with reward modeling, potentially harmonizing model behavior more consistently with human-driven evaluative criteria.