OmniBench: Towards The Future of Universal Omni-Language Models

Published 23 Sep 2024 in cs.CL, cs.AI, and cs.CV | (2409.15272v4)

Abstract: Recent advancements in multimodal LLMs (MLLMs) have focused on integrating multiple modalities, yet their ability to simultaneously process and reason across different inputs remains underexplored. We introduce OmniBench, a novel benchmark designed to evaluate models' ability to recognize, interpret, and reason across visual, acoustic, and textual inputs simultaneously. We define LLMs capable of such tri-modal processing as omni-LLMs (OLMs). OmniBench features high-quality human annotations that require integrated understanding across all modalities. Our evaluation reveals that: i) open-source OLMs show significant limitations in instruction-following and reasoning in tri-modal contexts; and ii) most baseline models perform poorly (around 50% accuracy) even with textual alternatives to image/audio inputs. To address these limitations, we develop OmniInstruct, an 96K-sample instruction tuning dataset for training OLMs. We advocate for developing more robust tri-modal integration techniques and training strategies to enhance OLM performance. Codes and data could be found at our repo (https://github.com/multimodal-art-projection/OmniBench).

Abstract PDF Upgrade to Chat

Citations (3)

View on Semantic Scholar

Summary

The paper introduces OmniBench, a benchmark that rigorously evaluates multimodal language models on integrated visual, acoustic, and textual inputs.
It employs a multi-stage human annotation process to ensure that all responses require a combined understanding and detailed reasoning across modalities.
Experimental results show that current models struggle with simultaneous processing of all modalities, highlighting the need for advanced integration techniques.

An Analysis of "OmniBench: Towards The Future of Universal Omni-LLMs"

The paper "OmniBench: Towards The Future of Universal Omni-LLMs" by Yizhi Li et al. introduces a comprehensive benchmark, OmniBench, designed to evaluate multimodal LLMs' (MLLMs) capability to concurrently recognize, interpret, and reason across visual, acoustic, and textual inputs. The authors define models capable of such tri-modal processing as omni-LLMs (OLMs). This benchmark stands apart due to its high-quality human annotations, ensuring that accurate responses necessitate integrated understanding and reasoning across all three modalities.

Key Contributions

Benchmark Design and Rigorous Evaluation: The OmniBench benchmark encompasses a diverse range of task types, progressing from fundamental perception (Object Identification) to complex inference (Contextual and Environmental). These tasks necessitate human-like cognitive abilities, such as temporal and logical order understanding, spatial awareness, entity recognition, symbolic processing, and quantitative reasoning. This taxonomy aims to test a wide spectrum of reasoning and cognitive abilities, providing a holistic assessment of MLLMs.
Annotation and Quality Control: The authors employed a stringent annotation protocol, involving three stages: initial annotation, human inspection, and model inspection. This process ensured that all annotated instruction-response pairs required information from both image and audio components to be accurately answered. The annotations included detailed rationales for correct answers, explaining the specific information derived from each modality. This rigor in design and quality control underscores the challenge and depth of the benchmark.
Experimental Results: The results demonstrated that existing open-source OLMs, such as the UnifiedIO2 series, show critical limitations in integrating tri-modal information. For instance, even the most well-performing open-source OLMs processed visual and acoustic information separately and struggled to leverage increased model capacity effectively. Moreover, the results revealed a general bias towards speech data, suggesting the need for a more balanced approach in future research and training paradigms.
Textual Approximation Experiments: To further expand the evaluation framework, the authors conducted textual approximation experiments where audio and images were replaced with text transcripts and captions, respectively. Vision-LLMs showed superior results compared to audio-LLMs in this approximation setting, indicating a potential direction for developing more robust OLMs. These findings highlight the unique challenges and opportunities in processing and understanding combined multimodal information.

Implications and Future Directions

The findings from the OmniBench benchmark reveal substantial gaps in the current state of multimodal LLMs. Several important implications arise from this research:

Need for Advanced Multimodal Integration: The performance of existing models indicates a significant need for developing new architectures and methodologies that can seamlessly integrate and reason across multiple modalities. Future research should focus on designing models that inherently understand and process multimodal inputs holistically rather than as separate entities.
Balanced and Diverse Training Data: The observed bias towards speech data suggests that the current training datasets may not be adequately balanced across modalities. There is a need for more diverse and representative datasets that encompass a wide array of real-world scenarios involving visual, acoustic, and textual data.
Future Benchmarks and Metrics: The comprehensive nature of OmniBench sets a new standard for evaluating multimodal models. Future benchmarks should continue to build on this foundation, incorporating more complex and nuanced tasks that mirror the real-world demands on AI systems.
Toward Human-like Multimodal Understanding: The ultimate goal of OmniBench is to drive progress towards models that approach human-like understanding and reasoning with multimodal data. This paper underscores the importance of challenging benchmarks in accelerating the development of such advanced AI systems.

In conclusion, the introduction of OmniBench represents a significant step forward in the evaluation of multimodal LLMs. While current models exhibit notable limitations, OmniBench provides a critical tool for identifying areas of improvement and guiding future research efforts. The comprehensive and rigorous nature of OmniBench will undoubtedly play a pivotal role in the continued advancement of AI towards achieving true omni-understanding capabilities.

Markdown Report Issue