MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models

Published 14 Oct 2024 in cs.CV, cs.CL, and cs.LG | (2410.10139v2)

Abstract: Interleaved multimodal comprehension and generation, enabling models to produce and interpret both images and text in arbitrary sequences, have become a pivotal area in multimodal learning. Despite significant advancements, the evaluation of this capability remains insufficient. Existing benchmarks suffer from limitations in data scale, scope, and evaluation depth, while current evaluation metrics are often costly or biased, lacking in reliability for practical applications. To address these challenges, we introduce MMIE, a large-scale knowledge-intensive benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-LLMs (LVLMs). MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts. It supports both interleaved inputs and outputs, offering a mix of multiple-choice and open-ended question formats to evaluate diverse competencies. Moreover, we propose a reliable automated evaluation metric, leveraging a scoring model fine-tuned with human-annotated data and systematic evaluation criteria, aimed at reducing bias and improving evaluation accuracy. Extensive experiments demonstrate the effectiveness of our benchmark and metrics in providing a comprehensive evaluation of interleaved LVLMs. Specifically, we evaluate eight LVLMs, revealing that even the best models show significant room for improvement, with most achieving only moderate results. We believe MMIE will drive further advancements in the development of interleaved LVLMs. We publicly release our benchmark and code in https://mmie-bench.github.io/.

Abstract PDF HTML Upgrade to Chat

Citations (5)

View on Semantic Scholar

Summary

The paper introduces MMIE—a novel, large-scale benchmark for assessing interleaved multimodal comprehension across 20K queries in 12 fields and 102 subfields.
The paper demonstrates that current LVLMs, with the top model scoring 65.47%, reveal significant performance gaps in handling complex, interleaved inputs.
The paper proposes a reliable automated scoring model, fine-tuned with human annotations, that aligns closely with human evaluation and reduces assessment biases.

The paper introduces MMIE (Massive Multimodal Interleaved Comprehension Evaluation), a novel benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-LLMs (LVLMs). The benchmark comprises 20K multimodal queries across 3 categories, 12 fields, and 102 subfields, incorporating mathematics, coding, physics, literature, health, and arts. MMIE supports interleaved inputs and outputs with multiple-choice and open-ended question formats. The authors propose a reliable automated evaluation metric using a scoring model fine-tuned with human-annotated data and systematic evaluation criteria.

The primary contributions of the paper are:

The introduction of MMIE, a large-scale interleaved multimodal benchmark for evaluating LVLMs.
Empirical demonstration of MMIE's difficulty, where the best-performing model (GPT-4o + SDXL) achieves a score of 65.47\%, indicating significant room for improvement.
A scoring model is proposed that is demonstrated to be reliable and comparable to human evaluation.

The paper addresses two key challenges in the evaluation of interleaved multimodal generation:

The difficulty in constructing modality-coherent benchmarks.
The lack of automated evaluation metrics.

To address these challenges, the MMIE benchmark was created from four multimodal datasets, categorized into situational analysis, project-based learning, and multi-step reasoning. The data curation process involved collecting and restructuring existing datasets to align with the interleaved image-and-text format. A multi-step quality control process was implemented to ensure the integrity and consistency of the dataset.

The automated evaluation metric involves fine-tuning InternVL-2-4B with a high-quality multimodal scoring dataset, accompanied by detailed scoring criteria and reference answers. The fine-tuned model is then used as the scoring model.

The experimental setup involves benchmarking four open-source interleaved LVLMs: MiniGPT-5, EMU-2, GILL, and Anole. The models were evaluated using the proposed metric, and the results were compared with human annotations using cosine similarity, mean square error (MSE), mean absolute error (MAE), and Pearson coefficient.

Key findings from the experiments include:

Evaluated interleaved LVLMs demonstrate average score of 50.80%, highlighting the difficulty of the benchmark.
Integrated LVLMs outperform open-source interleaved LVLMs by an average of 25.2%.
The integrated models outperform the best performance of the interleaved model by 14.6%, 26.3%, and 16.1% in situational analysis, project-based learning, and multi-step reasoning, respectively.
The fine-tuned scoring model demonstrates the closest alignment with human evaluation results, proving to be the most reliable.

Error analysis revealed challenges in temporal understanding and reasoning ability. Temporal understanding issues relate to multimodal information comprehension and cross-modality coherence, while reasoning issues involve complex reasoning and generation capabilities. The authors identify errors in cross-modality coherence, generation adaptability, multimodal information comprehension, and complex reasoning.

The paper concludes by highlighting the challenges and opportunities in interleaved multimodal tasks and states that the proposed metrics provide robust, human-like evaluation performance, significantly reducing errors and biases.