MARVEL: Multidimensional Abstraction and Reasoning through Visual Evaluation and Learning

Published 21 Apr 2024 in cs.CV and cs.LG | (2404.13591v2)

Abstract: While multi-modal LLMs (MLLMs) have shown significant progress on many popular visual reasoning benchmarks, whether they possess abstract visual reasoning abilities remains an open question. Similar to the Sudoku puzzles, abstract visual reasoning (AVR) problems require finding high-level patterns (e.g., repetition constraints) that control the input shapes (e.g., digits) in a specific task configuration (e.g., matrix). However, existing AVR benchmarks only considered a limited set of patterns (addition, conjunction), input shapes (rectangle, square), and task configurations (3 by 3 matrices). To evaluate MLLMs' reasoning abilities comprehensively, we introduce MARVEL, a multidimensional AVR benchmark with 770 puzzles composed of six core knowledge patterns, geometric and abstract shapes, and five different task configurations. To inspect whether the model accuracy is grounded in perception and reasoning, MARVEL complements the general AVR question with perception questions in a hierarchical evaluation framework. We conduct comprehensive experiments on MARVEL with nine representative MLLMs in zero-shot and few-shot settings. Our experiments reveal that all models show near-random performance on the AVR question, with significant performance gaps (40%) compared to humans across all patterns and task configurations. Further analysis of perception questions reveals that MLLMs struggle to comprehend the visual features (near-random performance) and even count the panels in the puzzle ( <45%), hindering their ability for abstract reasoning. We release our entire code and dataset.

Abstract PDF HTML Upgrade to Chat

Authors (8)

Citations (7)

View on Semantic Scholar

Summary

The paper introduces MARVEL, a benchmark that evaluates MLLMs' ability to solve abstract visual reasoning puzzles using diverse patterns and task configurations.
It employs a hierarchical evaluation framework with both coarse and fine-grained perception questions to assess visual details and spatial relationships.
Experimental results show near-random performance by MLLMs and a 40% human advantage, underscoring the need for improved abstraction mechanisms in AI.

MARVEL: Multidimensional Abstraction and Reasoning through Visual Evaluation and Learning

Introduction

"MARVEL: Multidimensional Abstraction and Reasoning through Visual Evaluation and Learning" (2404.13591) introduces a novel benchmark designed to evaluate the abstract visual reasoning (AVR) capabilities of multi-modal LLMs (MLLMs). Unlike traditional visual reasoning tasks, AVR requires models to identify high-level patterns that dictate the arrangement of input shapes in specified configurations. The paper critiques the limitations of existing AVR benchmarks, which predominantly focus on simplistic patterns and predefined configurations, and proposes MARVEL as a comprehensive alternative.

Benchmark Design

MARVEL is characterized by its diversity across six foundational knowledge patterns, geometric and abstract input shapes, and five distinct task configurations:

Patterns: Testing spans six patterns derived from core knowledge theory in human cognition, including Temporal Movement, Spatial Relationship, Quantities, Mathematical operations, 2D-Geometry, and 3D-Geometry.
Input Shapes: Puzzles contain both geometric and abstract shapes, the latter being rarely encountered in typical natural language processing tasks, to ensure a robust evaluation of MLLMs.
Task Configurations: The benchmark incorporates five configurations (Sequence, Two-row, Matrix, Group, and Reassembling) to present puzzles in varied forms.
Figure 1: An abstract visual reasoning puzzle in MARVEL. The puzzle contains mathematical patterns governing the element number in geometric shapes with two-row task configuration.

Hierarchical Evaluation Framework

MARVEL employs a hierarchical evaluation framework that includes perception questions designed to assess models' understanding of visual details. These questions are categorized into:

Coarse-Grained Perception: Open-ended questions that gauge models' ability to count panels, grids, or shapes accurately.
Fine-Grained Perception: Binary-choice questions focusing on detailed visual features such as shape attributes or spatial relationships.

Perception questions help determine whether inadequate visual understanding affects the models' abstract reasoning capabilities.

Figure 2: MLLMs and human performance across patterns and task configurations.

Experimental Results

Experiments conducted on MARVEL involved nine MLLMs in zero-shot and few-shot settings. The key findings include:

Models' Performance: All tested MLLMs demonstrated near-random performance in solving AVR puzzles, indicating significant deficits in abstract reasoning compared to humans, who outperformed models by a roughly 40% margin.
Impact of Few-Shot Learning: Implementing few-shot Chain-of-Thought (CoT) prompting showed minimal improvement. Models struggled to generalize abstract patterns from demonstrations, particularly with the diverse and novel input shapes in MARVEL.
Figure 3: The example is formatted in Sequence configuration with the Quantities pattern. The answer to this puzzle is B.

Implications and Future Research

The findings from MARVEL suggest that current MLLMs lack efficient visual abstraction mechanisms, crucial for robust AVR tasks. The performance gap signals the need for enhanced training data encompassing diverse visual patterns and improved model architectures that can more accurately interpret visual nuances.

Figure 4: The example is formatted in Sequence configuration with the Temporal Movement pattern. The answer to this puzzle is C.

Conclusion

MARVEL presents a challenging yet necessary step forward in evaluating and improving MLLMs' AVR capabilities. By highlighting the perceptual limitations and reasoning inconsistencies of state-of-the-art models, the benchmark serves as a catalyst for further research aimed at refining visual reasoning in AI systems.

Figure 5: The example is formatted in Two-row configuration with the Spatial Relationship pattern. The answer to this puzzle is B.

The paper's contribution to the AVR domain represents an essential integration into advancements in multimodal AI research, urging the development of models with superior cognitive abilities that mirror human-level abstract reasoning. The insights gained can guide future innovations in training methodologies and model enhancements.

Markdown Report Issue