- The paper presents InternLM-XComposer2.5-Reward, a multi-modal reward model that aligns LVLMs with human preferences via a diverse dataset spanning text, image, and video, achieving 70.0% on VL-RewardBench.
- The model employs a robust training strategy by freezing the vision encoder and fine-tuning the language model and score head, ensuring efficiency and precision.
- It supports key applications in reinforcement learning, test-time candidate selection, and data cleaning, outperforming previous generative reward models on benchmark evaluations.
This paper introduces InternLM-XComposer2.5-Reward (IXC-2.5-Reward), a multi-modal reward model designed to align Large Vision LLMs (LVLMs) with human preferences. The paper addresses the scarcity of publicly available multi-modal reward models and the lack of clarity surrounding the implementation details of proprietary models.
To ensure IXC-2.5-Reward's robustness and versatility, the authors constructed a high-quality multi-modal preference corpus. This corpus spans text, image, and video inputs across diverse domains, including instruction following, general understanding, text-rich documents, mathematical reasoning, and video understanding. The model achieves a 70.0% accuracy on the VL-RewardBench benchmark, surpassing previous generative RMs like Gemini-1.5-Pro (62.5%) and GPT-4o (62.4%). Even on uni-modal (text) RM benchmarks, IXC-2.5-Reward demonstrates good results, with an average score of 88.6% on Reward-Bench and 68.8% on RM-Bench.
Key aspects of the paper include:
The paper demonstrates three key applications of IXC-2.5-Reward:
- RL Training: IXC-2.5-Reward provides a supervisory signal for reinforcement learning training. The authors integrated IXC-2.5-Reward with Proximal Policy Optimization (PPO) to yield IXC-2.5-Chat, which shows improvements in instruction following and multi-modal open-ended dialogue. The PPO training involves sampling a prompt from a prompt set. The policy θπ​ model generates responses, and the reward model computes the reward score rt​ at each state st​ at time-step t. The temporal difference error δt​, the Generalized Advantage Estimation (GAE) At​, and the Returns Rt​ are computed as:
- δt​=rt​+γ⋅V(st+1​)−V(st​)
- At​=δt​+γ⋅β⋅At+1​
- Rt​=At​+V(st​)
- rt​: Reward score at each state st​ at time-step t
- V: Critic Model
- γ: Discount factor
- β: Parameter controlling the trade-off between bias and variance in advantage estimation.
Based on the advantage A, the policy gradient loss $\mathcal{L}_{\text{PG}$ is computed to update the policy model πθ​:
$\mathcal{L}_{\text{PG} = \min(\frac{\pi_{\theta}}{\pi_{\text{ref}} \cdot A, \text{clip}(\frac{\pi_{\theta}}{\pi_{\text{ref}}, 1.0 - \epsilon, 1.0 + \epsilon) \cdot A)$
* LPG​: Policy Gradient Loss
* πref​πθ​​: Log of the probability ratio between the policy model πθ​ and the reference model πref​
* ϵ: Hyper-parameter that controls the clipped ratio.
The critic model is updated via the Mean Squared Error (MSE) loss:
$\mathcal{L}_{\text{critic} = \sum_{t} \text{MSE}( V(s_{t}), R_{t} )$
* Lcritic​: Critic Loss
- Test-Time Scaling: IXC-2.5-Reward selects the best response from candidate responses for test-time scaling. The authors use Best-of-N sampling with IXC-2.5-Reward, leading to performance gains compared to the RL (Reinforcement Learning)-trained IXC-2.5-Chat.
- Data Cleaning: IXC-2.5-Reward filters outlier or noisy samples from existing image and video instruction tuning training data. The authors observe a correlation between low IXC-2.5-Reward scores and problematic samples, such as those exhibiting hallucinations or mismatched image/video and question/answer content.
The experimental results demonstrate that IXC-2.5-Reward achieves state-of-the-art performance on multi-modal reward model benchmarks and shows competitive performance on text-only reward model benchmarks. The authors also present visualization examples of IXC-2.5-Chat on a series of topics, such as instruction following and open-ended questions. These figures reveal that IXC-2.5-Chat demonstrates several key advantages, including superior organization and presentation, more comprehensive and in-depth answers, and more detailed explanations.