On the Faithfulness of Visual Thinking: Measurement and Enhancement

Published 27 Oct 2025 in cs.CV and cs.AI | (2510.23482v1)

Abstract: Recent large vision-LLMs (LVLMs) can generate vision-text multimodal chain-of-thought (MCoT) traces after reinforcement fine-tuning (RFT). However, we observe that the visual information incorporated in MCoT is often inaccurate, though still yield correct answers, indicating a lack of faithfulness in the MCoT reasoning process. We attribute this unfaithfulness to the RL reward in RFT, which solely incentivizes the format of interleaved vision-text cues, ie, it encourages the model to incorporate visual information into its text reasoning steps without considering the correctness of the visual information. In this paper, we first probe the faithfulness of MCoT by measuring how much the prediction changes when its visual and textual thoughts are intervened. Surprisingly, the model's predictions remain nearly unchanged under visual intervention but change significantly under textual intervention, indicating that the visual evidence is largely ignored. To further analyze visual information, we introduce an automated LVLM-based evaluation metric that quantifies the faithfulness of visual cues from two perspectives: reliability and sufficiency. Our evaluation reveals that the visual information in current MCoT traces is simultaneously unreliable and insufficient. To address this issue, we propose a novel MCoT learning strategy termed Sufficient-Component Cause Model (SCCM) learning. This approach encourages the MCoT to generate sufficient yet minimal visual components that are independently capable of leading to correct answers. We note that the proposed SCCM is annotation-free and compatible with various RFT for MCoT in a plug-and-play manner. Empirical results demonstrate that SCCM consistently improves the visual faithfulness across a suite of fine-grained perception and reasoning benchmarks. Code is available at https://github.com/EugeneLiu01/Faithful_Thinking_with_Image.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper identifies that LVLMs often use inaccurate visual cues in multimodal chain-of-thought reasoning, leading to unfaithful visual representations.
It introduces an automated evaluation metric using causal intervention to quantify the reliability and sufficiency of visual components.
The proposed Sufficient-Component Cause Model (SCCM) consistently improves visual evidence alignment with correct predictions across several benchmarks.

On the Faithfulness of Visual Thinking: Measurement and Enhancement

Introduction

The paper "On the Faithfulness of Visual Thinking: Measurement and Enhancement" (2510.23482) presents a comprehensive investigation into the faithfulness of visual reasoning in Multimodal Chain-of-Thought (MCoT) traces generated by large vision-LLMs (LVLMs). The study identifies a prevalent issue where visual components in MCoT often lack accuracy, even when resulting in correct answers. This discrepancy highlights a deficiency in the faithfulness of visual components in LVLM reasoning.

Problem Identification and Analysis

LVLMs, when fine-tuned through reinforcement learning (RFT), tend to optimize for interleaved vision-text cues rather than the correctness of the visual information. The paper employs causal intervention techniques to assess the impact of visual and textual interventions on the model's predictions. Results indicate that while textual interventions significantly alter predictions, visual interventions do not, suggesting an underutilization of visual evidence.

Figure 1: The mistakes present in the MCoT generated by current works.

Evaluation Metric and Experimental Results

To address the identified unfaithfulness, the paper introduces an automated evaluation metric designed to quantify the faithfulness of visual cues based on reliability and sufficiency. This metric is implemented using an external LVLM judger to evaluate the correctness and impact of visual components. The evaluation reveals that current MCoT traces are frequently unreliable and insufficient for deriving correct answers, underscoring the need for improved learning strategies.

Proposed Solution: Sufficient-Component Cause Model (SCCM)

The paper proposes the Sufficient-Component Cause Model (SCCM) learning strategy to rectify the deficiencies in visual reasoning. SCCM mandates that visual components in MCoT be both sufficient and minimal for deriving correct answers, thereby ensuring that visual information meaningfully contributes to the reasoning process. This approach is both annotation-free and compatible with various LVLM training methodologies.

Figure 2: The overview of our proposed Sufficient-Component Cause Model (SCCM) learning to establish visual information as sufficient-component causes to correct answers.

Experiments and Results

SCCM was empirically validated across a range of perception and reasoning benchmarks. The results indicate consistent improvements in the faithfulness of visual reasoning, as measured by the proposed evaluation metrics. SCCM not only enhances the interpretability of LVLMs but also aligns visual reasoning more closely with human cognitive processes.

Figure 3: Training dynamics on V

Bench as test dataset, with different ablation reward schemes.*

Conclusion

This paper successfully delineates the issue of unfaithful visual reasoning in LVLMs and provides a robust framework through SCCM to enhance the accuracy and utility of visual components in MCoT reasoning. Future work could explore integrating SCCM into more diverse multimodal applications and extending its principles to other areas of AI requiring visual-text reasoning. The improvements introduced by SCCM have significant implications for developing more interpretable and reliable AI systems in multimodal tasks.

Markdown Report Issue