SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

Published 30 Jul 2023 in cs.CL and cs.CV | (2307.16125v2)

Abstract: Based on powerful LLMs, recent generative Multimodal LLMs (MLLMs) have gained prominence as a pivotal research area, exhibiting remarkable capability for both comprehension and generation. In this work, we address the evaluation of generative comprehension in MLLMs as a preliminary step towards a comprehensive assessment of generative models, by introducing a benchmark named SEED-Bench. SEED-Bench consists of 19K multiple choice questions with accurate human annotations (x 6 larger than existing benchmarks), which spans 12 evaluation dimensions including the comprehension of both the image and video modality. We develop an advanced pipeline for generating multiple-choice questions that target specific evaluation dimensions, integrating both automatic filtering and manual verification processes. Multiple-choice questions with groundtruth options derived from human annotation enables an objective and efficient assessment of model performance, eliminating the need for human or GPT intervention during evaluation. We further evaluate the performance of 18 models across all 12 dimensions, covering both the spatial and temporal understanding. By revealing the limitations of existing MLLMs through evaluation results, we aim for SEED-Bench to provide insights for motivating future research. We will launch and consistently maintain a leaderboard to provide a platform for the community to assess and investigate model capability.

Abstract PDF Upgrade to Chat

Citations (369)

View on Semantic Scholar

Summary

The paper introduces SEED-Bench, which rigorously evaluates MLLMs' generative comprehension across spatial and temporal tasks with 19,000 annotated questions.
Its sophisticated pipeline combines automated extraction with manual validation to ensure high-quality questions and objective model evaluations.
Evaluation reveals that while spatial understanding is robust, VideoLLMs often lag in temporal tasks, highlighting key areas for further research.

An Expert Review of SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

The development of robust evaluation frameworks for Multimodal LLMs (MLLMs) is imperative as these models increasingly extend their capabilities across numerous modalities. The paper "SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension" introduces a noteworthy benchmark called SEED-Bench, designed specifically to evaluate MLLMs with an emphasis on generative comprehension across both spatial and temporal understanding. This work is well-structured, outlining a new framework with significant contributions to the field.

Fundamentally, SEED-Bench is a robust benchmarking tool specifically designed to perform objective and comprehensive evaluations of MLLMs. The benchmark comprises 19,000 human-annotated multiple-choice questions which span across 12 diverse evaluation dimensions. These dimensions include both spatial and temporal understanding, such as scene comprehension, instance identity, visual reasoning, action recognition, and more. Such a scale ( $\times$ 6 larger than prior benchmarks) provides a more comprehensive testbed for evaluating the breadth and depth of models' capabilities.

A key strength of the paper lies in the methodical construction of multiple-choice questions. SEED-Bench employs a sophisticated pipeline incorporating both automated processes and manual verification to generate and validate questions. This pipeline integrates foundational models to extract visual information and leverages advanced LLMs (e.g., ChatGPT/GPT-4) to generate and filter potential questions, ensuring they effectively evaluate the model's comprehension capabilities. The dual emphasis on human annotation and automatic filtering ensures high question quality and objective evaluations, a significant improvement over benchmarks relying heavily on subjective measures.

When applied to 18 models, ranging from LLMs to ImageLLMs and VideoLLMs, SEED-Bench reveals insightful observations. For instance, the BLIP series models demonstrate robust performance in spatial understanding tasks, while surprising findings indicate that VideoLLMs, despite their training on video data, often do not outperform ImageLLMs in temporal tasks. Such insights underscore the complexity and multidisciplinary nature of MLLMs, highlighting areas where models are proficient and where further research is needed.

The implications of SEED-Bench are twofold. Practically, it provides the research community with a reliable benchmark that offers a more detailed and nuanced evaluation of MLLMs across various tasks. Theoretically, the benchmark stimulates research into better understanding and improving the generative comprehension abilities of multimodal models. It presents a clear step towards quantifying model performance in a way that mirrors real-world applicability more accurately than many existing benchmarks.

Looking forward, SEED-Bench sets a high standard for the future development of benchmarks for MLLMs. It encourages the continued expansion of evaluation dimensions and datasets, emphasizing the need to continually adapt evaluation metrics to emerging model capabilities. Launching a publicly maintained leaderboard further stimulates progress, providing a platform for researchers to track advancements and identify persisting challenges in multimodal AI research.

In conclusion, SEED-Bench represents a significant advancement in effectively measuring the generative comprehension abilities of MLLMs. By addressing past limitations in existing benchmarks and offering a comprehensive evaluative framework, it not only provides a valuable resource for current research but also establishes a groundwork for future exploration and development in the field of MLLMs.