Mechanism by which RL improves multimodal reasoning capability in MLLMs

Ascertain the mechanism by which reinforcement learning–based post-training improves the multimodal reasoning capability of Multimodal Large Language Models, identifying whether the gains stem from enhanced visual grounding, improved textual reasoning, or other factors.

Background

Despite notable accuracy improvements reported for RL-trained MLLMs, the specific pathways by which RL enhances multimodal reasoning remain unspecified. Understanding this mechanism is essential for designing modality-aware RL strategies that genuinely improve visual reasoning.

The authors approach this by analyzing models trained and evaluated under hallucination-inductive corruptions, which force reliance on hallucinated trajectories. Observations from these settings help differentiate between improvements due to visual grounding and those due to textual priors.

References

Despite the impressive gains in reasoning accuracy reported in recent RL-trained MLLMs, how RL improved the multimodal reasoning capability is still unknown.

Understanding the Role of Hallucination in Reinforcement Post-Training of Multimodal Reasoning Models  (2604.03179 - Zhang et al., 3 Apr 2026) in Section 1 Introduction (page 1)