Understand, Think, and Answer: Advancing Visual Reasoning with Large Multimodal Models

Published 27 May 2025 in cs.CV and cs.AI | (2505.20753v1)

Abstract: Large Multimodal Models (LMMs) have recently demonstrated remarkable visual understanding performance on both vision-language and vision-centric tasks. However, they often fall short in integrating advanced, task-specific capabilities for compositional reasoning, which hinders their progress toward truly competent general vision models. To address this, we present a unified visual reasoning mechanism that enables LMMs to solve complicated compositional problems by leveraging their intrinsic capabilities (e.g. grounding and visual understanding capabilities). Different from the previous shortcut learning mechanism, our approach introduces a human-like understanding-thinking-answering process, allowing the model to complete all steps in a single pass forwarding without the need for multiple inferences or external tools. This design bridges the gap between foundational visual capabilities and general question answering, encouraging LMMs to generate faithful and traceable responses for complex visual reasoning. Meanwhile, we curate 334K visual instruction samples covering both general scenes and text-rich scenes and involving multiple foundational visual capabilities. Our trained model, Griffon-R, has the ability of end-to-end automatic understanding, self-thinking, and reasoning answers. Comprehensive experiments show that Griffon-R not only achieves advancing performance on complex visual reasoning benchmarks including VSR and CLEVR, but also enhances multimodal capabilities across various benchmarks like MMBench and ScienceQA. Data, models, and codes will be release at https://github.com/jefferyZhan/Griffon/tree/master/Griffon-R soon.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a unified understand-think-answer mechanism that processes visual reasoning in a single forward pass, enhancing efficiency over multi-step toolkit-based methods.
The paper details a novel training pipeline using 334K curated visual instruction samples and semi-automatic expert-human curation to overcome shortcut learning in LMMs.
The paper demonstrates that its Griffon-R model achieves state-of-the-art or competitive results on benchmarks like CLEVR, GQA, and TallyQA, validating its compositional reasoning improvements.

The paper "Understand, Think, and Answer: Advancing Visual Reasoning with Large Multimodal Models" (2505.20753) proposes a unified visual reasoning mechanism for Large Multimodal Models (LMMs) to address their limitations in compositional visual reasoning. Existing LMMs often rely on shortcut learning, directly mapping questions to answers, which hinders their ability to perform complex reasoning tasks and can lead to hallucinations. While methods like Chain-of-Thought (CoT) and toolkit-based approaches attempt to improve reasoning, they often require multiple inference steps or external tools, leading to inefficiencies and reduced generality.

The proposed unified mechanism introduces a human-like "understand-think-answer" process that operates in a single forward pass.

Understand: The model first analyzes the question and image to determine the necessary information for answering the question and plans how to acquire it using intrinsic capabilities (like visual grounding, object detection, text recognition, captioning, etc.). It generates structured instructions or cues to guide the gathering of relevant visual information. This is a flexible process tailored to the specific question, rather than relying on fixed reasoning paths or external tool calls. If relevant information is not found, the model indicates this, avoiding misleading outputs.
Think: Based on the visual cues gathered during the understanding phase, the model engages in self-prompted contextual thinking. Leveraging the capabilities of the underlying LLM, it processes the visual information and the original question.
Answer: Finally, the model generates the ultimate answer based on the understanding and thinking processes. This entire sequence, from understanding to thinking to answering, is performed in a single autoregressive generation pass.

To train LMMs to follow this mechanism, the authors curated a dataset of 334K visual instruction samples. They developed a semi-automatic expert-supervised data engine for this purpose. The process involves:

Progressive Annotation with AI Expert: Using a state-of-the-art LMM (Qwen2-VL-72B) as an AI expert to analyze questions, plan information acquisition steps, and annotate tasks it excels at (e.g., global captions, text recognition). Instructions guide the AI to identify key objects/entities and the necessary intrinsic capabilities.
Curation with Human Expert: Human experts complete tasks the AI struggles with, particularly multi-object visual grounding or partial text recognition. They also review the AI-generated annotations for quality, ensuring the understanding process is logical and annotations are accurate, and filtering out overly simplistic samples.

The curated dataset covers both general scenes and text-rich scenes and incorporates multi-task instruction-following data from various public datasets, including VQA datasets (GQA, VAW, VizWiz, ChartQA, DUE_Benchmark, TextVQA), instruction data (LLaVA, ALLaVA, LVIS-Instruct4V), and caption data (ShareGPT-4V).

Based on this mechanism and data, the authors developed Griffon-R, an LMM built upon the Griffon v2 architecture (Zhan et al., 2024). Griffon-R utilizes a single-branch, high-resolution structure with a visual encoder (CLIP-ViT-L/14-336), a vision-language connector, and an LLM (Gemma9B). It is trained in a multi-stage process:

Stage 1: Pretraining the vision-language connector with visual captioning data (ShareGPT-4V).
Stage 2: Pretraining the whole model on a diverse set of perception tasks including Referring Expression Comprehension/Generation, Visual Grounding, and Object Detection, as well as general language and instruction following data.
Stage 3: Fine-tuning the whole model using the curated visual reasoning data combined with other general VQA and instruction data. Standard cross-entropy loss is used.

Experiments were conducted on various visual reasoning benchmarks (CLEVR, VSR, GQA, V-Star $_{Spat.}$ , TallyQA) and general multimodal benchmarks (MMBench, ScienceQA, TextVQA, SEED, LLaVA Bench, POPE).

Results demonstrate that Griffon-R achieves state-of-the-art or competitive performance across these benchmarks, particularly excelling in compositional visual reasoning tasks like VSR, CLEVR, V-Star $_{Spat.}$ , and TallyQA. It also shows strong performance on general multimodal tasks and text-rich VQA, surpassing many advanced LMMs and methods specifically designed with CoT or toolkits. Ablation studies confirm the effectiveness of the proposed mechanism and the curated data. The understanding quality, measured by REC performance, is high. The mechanism shows better accuracy and significantly faster inference time compared to a toolkit-based approach on V-Star $_{Spat.}$ . Training with the curated annotations aligned with the mechanism is shown to be crucial for achieving robust visual reasoning.

The authors note limitations, including potential increased response time for complex scenarios involving associated objects due to the growing output sequence, and the inheritance of limitations from the AI expert model used in data curation, such as potential inaccuracies (mitigated by human verification). They also highlight data usage restrictions related to the source models/datasets.

In conclusion, the paper presents a novel "understand-think-answer" mechanism and a corresponding high-quality dataset for training LMMs in compositional visual reasoning. The resulting model, Griffon-R, demonstrates improved reasoning capabilities and overall multimodal performance in an efficient, end-to-end manner without external tool reliance.