Generalizing from SIMPLE to HARD Visual Reasoning: Can We Mitigate Modality Imbalance in VLMs?

Published 5 Jan 2025 in cs.CV, cs.CL, and cs.LG | (2501.02669v2)

Abstract: Vision LLMs (VLMs) are impressive at visual question answering and image captioning. But they underperform on multi-step visual reasoning -- even compared to LLMs on the same tasks presented in text form -- giving rise to perceptions of modality imbalance or brittleness. Towards a systematic study of such issues, we introduce a synthetic framework for assessing the ability of VLMs to perform algorithmic visual reasoning, comprising three tasks: Table Readout, Grid Navigation, and Visual Analogy. Each has two levels of difficulty, SIMPLE and HARD, and even the SIMPLE versions are difficult for frontier VLMs. We propose strategies for training on the SIMPLE version of tasks that improve performance on the corresponding HARD task, i.e., simple-to-hard (S2H) generalization. This controlled setup, where each task also has an equivalent text-only version, allows a quantification of the modality imbalance and how it is impacted by training strategy. We show that 1) explicit image-to-text conversion is important in promoting S2H generalization on images, by transferring reasoning from text; 2) conversion can be internalized at test time. We also report results of mechanistic study of this phenomenon. We identify measures of gradient alignment that can identify training strategies that promote better S2H generalization. Ablations highlight the importance of chain-of-thought.

Abstract PDF Upgrade to Chat

Summary

The paper demonstrates that VLMs trained on SIMPLE tasks exhibit a significant performance drop when generalizing to HARD tasks due to modality imbalance.
It introduces synthetic tasks such as CTR, GN, and AR that systematically evaluate cross-modality reasoning by comparing text-only and image-only variants.
The study shows that mixed supervision and image-to-text conversion strategies can significantly boost VLM performance on complex visual reasoning challenges.

Addressing Modality Imbalance in Vision LLMs Through Synthetic Tasks

In contemporary discussions surrounding Vision LLMs (VLMs), issues of "modality imbalance" have gained significant attention, particularly when applying VLMs to visual reasoning tasks. The research paper titled "Generalizing from SIMPLE to HARD Visual Reasoning: Can We Mitigate Modality Imbalance in VLMs?" addresses these concerns methodically by proposing a synthetic framework and associated tasks to scrutinize and ameliorate VLM performance discrepancies in cross-modality reasoning.

The overarching goal of this study is to illuminate how VLMs, while adept at tasks like visual question answering (VQA) and image captioning, often exhibit decreased efficacy in performing multi-step reasoning in visual contexts. This is contextualized as a modality imbalance issue, where there is a disparity in the VLMs' reasoning abilities depending on whether information is conveyed visually or textually.

Synthetic Framework and Tasks

The authors propose a suite of tasks aimed at systematic evaluation of algorithmic visual reasoning (AVR). These include:

Continual Table Readout (CTR): This task involves sequential reading of numbers from a starting cell to an ending cell in a grid, with inputs provided either as images or LaTeX text. The task tests the VLMs' ability to comprehend and navigate through tabular data.
Grid Navigation (GN): In this task, VLMs navigate a graphical grid from a start to a destination point while collecting specified objects and avoiding obstacles. This requires spatial reasoning capabilities similar to pathfinding.
Abstract Reasoning (AR): Mimicking human IQ tests, this task involves identifying patterns among geometric shapes with attributes like color and size across different panels. It requires mapping abstract relations and logical reasoning.

Each task is designed with 'SIMPLE' (simple) and 'HARD' (complex) variants to test the models' capability to generalize from simple to complex tasks. This framework crucially allows for direct comparison between text-only and image-only task variants, aiding in quantifying modality imbalance.

Key Findings and Approaches

Generalization Insights: A fundamental observation in the study is that VLMs trained on SIMPLE tasks demonstrate a significant drop in performance when generalizing to HARD tasks, particularly when the tasks are presented in image format. This suggests a modality gap rooted in the differences in processing visual and textual data.

Mitigation Strategies: The researchers propose several training strategies incorporating mixed supervision (both text and image data) to bridge the modality gap:

Image Reasoning via Text Conversion: This technique involves training the model to convert image information to text format and then reason about the text, leveraging the inherent reasoning capabilities of LLMs.
Mix Supervision: By integrating both text and image inputs alongside image-to-text conversion tasks, Mix supervision aims to create cross-modality synergy. The study reports this approach significantly improves VLMs' performance on HARD tasks set in image formats.
Alignment-focused Training: Inspired by the gradient alignment insights in training, they suggest an alignment phase focusing solely on SIMPLE tasks to ensure coherence in text and image reasoning before tackling HARD tasks.

Implications and Future Directions

This research underscores the value of synthetic, structured tasks in dissecting the capabilities and limitations of VLMs. Addressing modality imbalance holds substantial implications for enhancing the robustness of multimodal AI systems in real-world applications — from autonomous systems requiring intricate spatial reasoning to visual data integration in complex analytics.

Future research directions include developing methodologies to internalize reasoning processes within VLMs, enhancing inference-time efficiency without relying on exhaustive input-to-text conversions, and expanding evaluation to incorporate more varied datasets reflective of realistic environments. With an iterative combination of theoretical insights and empirical method development, advancements in VLM training could lead to substantial shifts in how AI comprehends and integrates diverse modal inputs.