SophiaVL-R1: Reinforcing MLLMs Reasoning with Thinking Reward

Published 22 May 2025 in cs.CV | (2505.17018v1)

Abstract: Recent advances have shown success in eliciting strong reasoning abilities in multimodal LLMs (MLLMs) through rule-based reinforcement learning (RL) with outcome rewards. However, this paradigm typically lacks supervision over the thinking process leading to the final outcome.As a result, the model may learn sub-optimal reasoning strategies, which can hinder its generalization ability. In light of this, we propose SophiaVL-R1, as an attempt to add reward signals for the thinking process in this paradigm. To achieve this, we first train a thinking reward model that evaluates the quality of the entire thinking process. Given that the thinking reward may be unreliable for certain samples due to reward hacking, we propose the Trust-GRPO method, which assigns a trustworthiness weight to the thinking reward during training. This weight is computed based on the thinking reward comparison of responses leading to correct answers versus incorrect answers, helping to mitigate the impact of potentially unreliable thinking rewards. Moreover, we design an annealing training strategy that gradually reduces the thinking reward over time, allowing the model to rely more on the accurate rule-based outcome reward in later training stages. Experiments show that our SophiaVL-R1 surpasses a series of reasoning MLLMs on various benchmarks (e.g., MathVisita, MMMU), demonstrating strong reasoning and generalization capabilities. Notably, our SophiaVL-R1-7B even outperforms LLaVA-OneVision-72B on most benchmarks, despite the latter having 10 times more parameters. All code, models, and datasets are made publicly available at https://github.com/kxfan2002/SophiaVL-R1.

Abstract PDF Upgrade to Chat

Summary

The paper presents a novel thinking reward model integrated via Trust-GRPO to systematically enhance MLLMs’ reasoning.
It constructs a robust multimodal dataset to train and evaluate reasoning quality by combining text and visual data.
Experimental results demonstrate improved performance on reasoning benchmarks, validating the approach and mitigating reward hacking.

SophiaVL-R1: Reinforcing MLLMs Reasoning with Thinking Reward

SophiaVL-R1 introduces novel approaches to enhance reasoning in Multimodal LLMs (MLLMs) using thinking reward signals, integrated into rule-based reinforcement learning (RL). This paper presents methodologies for incorporating thinking rewards to improve reasoning quality while addressing common pitfalls such as reward hacking through a trustworthiness-aware adjustment.

Introduction

SophiaVL-R1 tackles the challenge of inadequate supervision in traditional RL methods that focus solely on the final outcome. This limited oversight can lead to MLLMs adopting sub-optimal reasoning strategies, impacting their generalization capabilities. The paper proposes a thinking reward model to evaluate the quality of the thinking process and introduces Trust-GRPO to mitigate unreliable thinking rewards.

The introduction of thinking rewards aims to provide intermediate feedback on reasoning quality, encouraging MLLMs to favor systematic deduction over flawed thinking paths. Trust-GRPO incorporates a trustworthiness weight to adjust the influence of these rewards, relying less on unreliable signals.

Figure 1: Examples of model responses and their corresponding thinking rewards.

Methodology

Dataset Construction

SophiaVL-R1 employs a dataset composed of annotated samples aggregated from various multimodal reasoning tasks to train both the thinking reward model and SophiaVL-R1. The dataset intertwines text-based and multimodal data, ensuring robust performance across diverse scenarios.

Figure 2: Left: Composition of our aggregated dataset SophiaVL-R1-130k from public sources. Right: Distribution of the SophiaVL-R1-Thinking-156k dataset used to train the thinking reward model.

Thinking Reward Model

The thinking reward model evaluates reasoning quality holistically rather than step-wise. It uses criteria like logical soundness and consistency to assess reasoning responses, differentiating sound from flawed reasoning processes through annotated training.

Trust-GRPO Algorithm

Trust-GRPO minimizes reward hacking risks by introducing a trustworthiness weight, computed by contrasting thinking rewards of responses leading to correct versus incorrect answers. This adaptive strategy ensures reliable integration of thinking rewards in the RL process.

Figure 3: An illustration of our proposed Trust-GRPO.

Experimental Results

SophiaVL-R1 demonstrates superior performance across various reasoning benchmarks, surpassing models with significantly more parameters. The thinking reward model and Trust-GRPO effectively contribute to enhancing reasoning and generalization capabilities.

Ablation Study

Ablation studies reveal the crucial roles of the thinking reward model and the trustworthiness weight in optimizing reasoning performance. Modifying these elements leads to notable drops in effectiveness, highlighting their importance.

Figure 4: Training curves of mean rule-based outcome reward across different methods.

Conclusion

SophiaVL-R1 effectively integrates thinking rewards with RL outcome rewards, offering improved guidance for reasoning in MLLMs. Through Trust-GRPO, the framework addresses reward signal reliability, paving the way for future enhancements in model reasoning capabilities.

Markdown Report Issue