InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model

Published 21 Jan 2025 in cs.CV and cs.CL | (2501.12368v2)

Abstract: Despite the promising performance of Large Vision LLMs (LVLMs) in visual understanding, they occasionally generate incorrect outputs. While reward models (RMs) with reinforcement learning or test-time scaling offer the potential for improving generation quality, a critical gap remains: publicly available multi-modal RMs for LVLMs are scarce, and the implementation details of proprietary models are often unclear. We bridge this gap with InternLM-XComposer2.5-Reward (IXC-2.5-Reward), a simple yet effective multi-modal reward model that aligns LVLMs with human preferences. To ensure the robustness and versatility of IXC-2.5-Reward, we set up a high-quality multi-modal preference corpus spanning text, image, and video inputs across diverse domains, such as instruction following, general understanding, text-rich documents, mathematical reasoning, and video understanding. IXC-2.5-Reward achieves excellent results on the latest multi-modal reward model benchmark and shows competitive performance on text-only reward model benchmarks. We further demonstrate three key applications of IXC-2.5-Reward: (1) Providing a supervisory signal for RL training. We integrate IXC-2.5-Reward with Proximal Policy Optimization (PPO) yields IXC-2.5-Chat, which shows consistent improvements in instruction following and multi-modal open-ended dialogue; (2) Selecting the best response from candidate responses for test-time scaling; and (3) Filtering outlier or noisy samples from existing image and video instruction tuning training data. To ensure reproducibility and facilitate further research, we have open-sourced all model weights and training recipes at https://github.com/InternLM/InternLM-XComposer/tree/main/InternLM-XComposer-2.5-Reward

Abstract PDF Upgrade to Chat

Summary

The paper presents InternLM-XComposer2.5-Reward, a multi-modal reward model that aligns LVLMs with human preferences via a diverse dataset spanning text, image, and video, achieving 70.0% on VL-RewardBench.
The model employs a robust training strategy by freezing the vision encoder and fine-tuning the language model and score head, ensuring efficiency and precision.
It supports key applications in reinforcement learning, test-time candidate selection, and data cleaning, outperforming previous generative reward models on benchmark evaluations.

This paper introduces InternLM-XComposer2.5-Reward (IXC-2.5-Reward), a multi-modal reward model designed to align Large Vision LLMs (LVLMs) with human preferences. The paper addresses the scarcity of publicly available multi-modal reward models and the lack of clarity surrounding the implementation details of proprietary models.

To ensure IXC-2.5-Reward's robustness and versatility, the authors constructed a high-quality multi-modal preference corpus. This corpus spans text, image, and video inputs across diverse domains, including instruction following, general understanding, text-rich documents, mathematical reasoning, and video understanding. The model achieves a 70.0% accuracy on the VL-RewardBench benchmark, surpassing previous generative RMs like Gemini-1.5-Pro (62.5%) and GPT-4o (62.4%). Even on uni-modal (text) RM benchmarks, IXC-2.5-Reward demonstrates good results, with an average score of 88.6% on Reward-Bench and 68.8% on RM-Bench.

Key aspects of the paper include:

Data Preparation: The authors collected a multi-modal preference dataset, incorporating existing high-quality datasets and newly collected data. The newly collected data includes prompts across diverse domains for text, image, and video inputs. The pipeline selects prompts across diverse domains for text, image, and video inputs, generates corresponding responses, and then uses GPT-4o or verifiers to perform preference judgments. Open-source pairwise data focuses on instruction following, safety, and general knowledge. The new data includes text-rich document understanding, math reasoning, and video understanding. The authors prompted the supervised fine-tuning (SFT) model, InternLM-XComposer-2.5 (IXC-2.5), to generate multiple outputs for each prompt to obtain rejected responses.
Model Architecture: IXC-2.5-Reward is built upon the SFT model (IXC-2.5). The pre-trained weights of IXC-2.5-Chat are used for the visual encoder and the MLP (Multi-Layer Perceptron) projector. The final linear layer of IXC-2.5 is replaced with a score head $f$ $f$ for IXC-2.5-Reward that predicts the reward score. Given an input prompt $x$ $x$ and response $y$ $y$ , the score head $f$ $f$ transforms the averaged hidden state features of all tokens into a scalar $r(x, y)$ $r (x, y)$ , which serves as the predicted reward score for the inputs.
- $r(x, y)$ : Predicted reward score for prompt $x$ and response $y$
Loss Function: The reward model is trained via the loss function:

$\mathcal{L}_{\text{RM} = - E(\log(\sigma(r(x, y_{w})) - r(x, y_{l})))$
- $\mathcal{L}_{\text{RM}}$ : Reward Model Loss
- $E$ : Expectation
- $\sigma$ : Sigmoid function
- $r(x, y_{w})$ : Reward score assigned to the prompt $x$ with the chosen data $y_{w}$
- $r(x, y_{l})$ : Reward score assigned to the prompt $x$ with the rejected data $y_{l}$
Training Strategy: The model's vision encoder and projector are frozen and initialized from IXC-2.5, training only the LLM (InternLM) and the score head.
Length Constraints: Data pairs are removed where the length of the chosen response $y_{w}$ is significantly longer than the length of the rejected response $y_{l}$ to prevent the reward model from learning to associate length with quality.

The paper demonstrates three key applications of IXC-2.5-Reward:

RL Training: IXC-2.5-Reward provides a supervisory signal for reinforcement learning training. The authors integrated IXC-2.5-Reward with Proximal Policy Optimization (PPO) to yield IXC-2.5-Chat, which shows improvements in instruction following and multi-modal open-ended dialogue. The PPO training involves sampling a prompt from a prompt set. The policy $\theta_{\pi}$ $θ_{π}$ model generates responses, and the reward model computes the reward score $r_{t}$ $r_{t}$ at each state $s_{t}$ $s_{t}$ at time-step $t$ $t$ . The temporal difference error $\delta_{t}$ $δ_{t}$ , the Generalized Advantage Estimation (GAE) $A_{t}$ $A_{t}$ , and the Returns $R_{t}$ $R_{t}$ are computed as:
- $\delta_{t} = r_{t} + \gamma \cdot V(s_{t+1}) - V(s_{t})$
- $A_{t} = \delta_{t} + \gamma \cdot \beta \cdot A_{t+1}$
- $R_{t} = A_{t} + V(s_{t})$
- $r_{t}$ : Reward score at each state $s_{t}$ at time-step $t$
- $V$ : Critic Model
- $\gamma$ : Discount factor
- $\beta$ : Parameter controlling the trade-off between bias and variance in advantage estimation. Based on the advantage $A$ , the policy gradient loss $\mathcal{L}_{\text{PG}$ is computed to update the policy model $\pi_{\theta}$ :
$\mathcal{L}_{\text{PG} = \min(\frac{\pi_{\theta}}{\pi_{\text{ref}} \cdot A, \text{clip}(\frac{\pi_{\theta}}{\pi_{\text{ref}}, 1.0 - \epsilon, 1.0 + \epsilon) \cdot A)$

* $\mathcal{L}_{\text{PG}}$ : Policy Gradient Loss * $\frac{\pi_{\theta}}{\pi_{\text{ref}}}$ : Log of the probability ratio between the policy model $\pi_{\theta}$ and the reference model $\pi_{\text{ref}}$ * $\epsilon$ : Hyper-parameter that controls the clipped ratio. The critic model is updated via the Mean Squared Error (MSE) loss:

$\mathcal{L}_{\text{critic} = \sum_{t} \text{MSE}( V(s_{t}), R_{t} )$

* $\mathcal{L}_{\text{critic}}$ : Critic Loss

Test-Time Scaling: IXC-2.5-Reward selects the best response from candidate responses for test-time scaling. The authors use Best-of- $N$ sampling with IXC-2.5-Reward, leading to performance gains compared to the RL (Reinforcement Learning)-trained IXC-2.5-Chat.
Data Cleaning: IXC-2.5-Reward filters outlier or noisy samples from existing image and video instruction tuning training data. The authors observe a correlation between low IXC-2.5-Reward scores and problematic samples, such as those exhibiting hallucinations or mismatched image/video and question/answer content.

The experimental results demonstrate that IXC-2.5-Reward achieves state-of-the-art performance on multi-modal reward model benchmarks and shows competitive performance on text-only reward model benchmarks. The authors also present visualization examples of IXC-2.5-Chat on a series of topics, such as instruction following and open-ended questions. These figures reveal that IXC-2.5-Chat demonstrates several key advantages, including superior organization and presentation, more comprehensive and in-depth answers, and more detailed explanations.