MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks

Published 14 Oct 2024 in cs.CV | (2410.10563v2)

Abstract: We present MEGA-Bench, an evaluation suite that scales multimodal evaluation to over 500 real-world tasks, to address the highly heterogeneous daily use cases of end users. Our objective is to optimize for a set of high-quality data samples that cover a highly diverse and rich set of multimodal tasks, while enabling cost-effective and accurate model evaluation. In particular, we collected 505 realistic tasks encompassing over 8,000 samples from 16 expert annotators to extensively cover the multimodal task space. Instead of unifying these problems into standard multi-choice questions (like MMMU, MMBench, and MMT-Bench), we embrace a wide range of output formats like numbers, phrases, code, \LaTeX, coordinates, JSON, free-form, etc. To accommodate these formats, we developed over 40 metrics to evaluate these tasks. Unlike existing benchmarks, MEGA-Bench offers a fine-grained capability report across multiple dimensions (e.g., application, input type, output format, skill), allowing users to interact with and visualize model capabilities in depth. We evaluate a wide variety of frontier vision-LLMs on MEGA-Bench to understand their capabilities across these dimensions.

Abstract PDF HTML Upgrade to Chat

Authors (16)

First 10 authors:

Citations (1)

View on Semantic Scholar

Summary

The paper introduces MEGA-Bench, a novel benchmark that systematically evaluates vision-language models on 505 real-world multimodal tasks.
It leverages varied output formats and 16 expert annotations to provide detailed performance insights across applications like coding, perception, and planning.
Results reveal GPT-4o’s superior multimodal alignment and demonstrate that chain-of-thought prompting significantly benefits proprietary models.

Comprehensive Evaluation of Multimodal Models with MEGA-Bench

The paper presents MEGA-Bench, a novel multimodal evaluation framework designed to systematically assess the capabilities of vision-LLMs (VLMs). This benchmark differentiates itself by encompassing over 500 real-world tasks curated from diverse sources, aimed at evaluating models in a cost-effective manner. MEGA-Bench offers a more comprehensive assessment compared to existing benchmarks, which often focus on a single or limited range of tasks.

Key Features

MEGA-Bench is structured to provide detailed insights into various dimensions of multimodal models. Unlike prior benchmarks that rely heavily on multiple-choice formats, MEGA-Bench embraces a multitude of output formats such as numerical, structured, open-ended, and contextual formats. The benchmark comprises 505 tasks with more than 8,000 samples, gathered from 16 expert annotators.

Evaluation and Findings

The paper evaluates a range of state-of-the-art models, including proprietary models like GPT-4o and open-source models such as Qwen2-VL-72B. Key findings include:

Performance Hierarchy: GPT-4o emerges as the currently top-performing model, surpassing its competitors in various skill dimensions. This is attributed to its superior performance in tasks requiring multimodal alignment and logical reasoning.
Optimization via Chain-of-Thought (CoT): Proprietary models benefit significantly from CoT prompting, which aids in better reasoning processes, whereas open-source models show mixed results, often struggling to generate coherent reasoning chains.
Diverse Task Coverage: The benchmark's extensive task taxonomy ensures wide coverage across applications such as coding, information extraction, perception, and planning, highlighting strengths and shortcomings.
Inference Efficiency: The benchmark is designed to optimize computational resources by focusing on expanding task diversity rather than increasing the number of instances per task, achieving robust performance metrics with fewer examples.

Implications and Future Directions

The meticulously crafted MEGA-Bench offers a granular view of model competencies across multiple dimensions, setting a new standard in multimodal evaluations. Its comprehensive nature aids developers in identifying areas for model improvement and tailoring models for specific applications. The introduction of nuanced evaluation metrics also highlights the practical utility of these models in real-world scenarios.

Going forward, the development of MEGA-Bench suggests several avenues for future research in AI. Models may be further refined to leverage CoT prompting more effectively, particularly for open-source models. Additionally, the benchmark could evolve to include more interactive, real-time evaluations to simulate realistic application environments.

In conclusion, MEGA-Bench presents a substantial step forward in evaluating multimodal models, providing the AI research community with a robust tool to advance the development of more capable and versatile vision-LLMs.

Markdown Report Issue