HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models

Published 23 Oct 2023 in cs.CV and cs.CL | (2310.14566v5)

Abstract: We introduce HallusionBench, a comprehensive benchmark designed for the evaluation of image-context reasoning. This benchmark presents significant challenges to advanced large visual-LLMs (LVLMs), such as GPT-4V(Vision), Gemini Pro Vision, Claude 3, and LLaVA-1.5, by emphasizing nuanced understanding and interpretation of visual data. The benchmark comprises 346 images paired with 1129 questions, all meticulously crafted by human experts. We introduce a novel structure for these visual questions designed to establish control groups. This structure enables us to conduct a quantitative analysis of the models' response tendencies, logical consistency, and various failure modes. In our evaluation on HallusionBench, we benchmarked 15 different models, highlighting a 31.42% question-pair accuracy achieved by the state-of-the-art GPT-4V. Notably, all other evaluated models achieve accuracy below 16%. Moreover, our analysis not only highlights the observed failure modes, including language hallucination and visual illusion, but also deepens an understanding of these pitfalls. Our comprehensive case studies within HallusionBench shed light on the challenges of hallucination and illusion in LVLMs. Based on these insights, we suggest potential pathways for their future improvement. The benchmark and codebase can be accessed at https://github.com/tianyi-lab/HallusionBench.

Abstract PDF Upgrade to Chat

Citations (89)

View on Semantic Scholar

Summary

The paper introduces HallusionBench, a diagnostic benchmark with 346 images and 1129 questions designed to analyze LVLMs' hallucination and visual illusion errors.
It categorizes questions into visual-dependent and visual-supplement types, offering controlled experiments to assess logical consistency and reasoning capabilities.
Empirical results reveal GPT-4V's 31.42% accuracy and highlight broader limitations in balancing language with visual inputs, pointing to key areas for improvement.

An Evaluation of HallusionBench: Diagnosing Hallucination and Illusion in Vision-LLMs

The paper presents "HallusionBench," a diagnostic benchmark suite developed to evaluate advanced visual-LLMs (LVLMs) on their abilities to process and synthesize image-context reasoning. This benchmark was developed with the express aim of quantifying the hallucination and illusion capabilities intrinsic to LVLMs, especially for models integrated with high-caliber LLMs. The benchmark is thoroughly challenging for state-of-the-art visual models such as GPT-4V, Claude 3, and others by pushing these systems beyond typical performance metrics into nuanced territories of logic, reasoning, and contextual comprehension.

Structure and Content of HallusionBench

At its core, HallusionBench consists of 346 images coupled with 1129 expertly crafted questions. These questions are structured into visual question (VQ) pairs that provide a controlled experimental environment to analyze both the overt capabilities and the limitations of various models. This structure permits a granular investigation into specific failure modes and response tendencies, presenting invaluable insights into logical consistency and model robustness.

The paper emphasizes the distinct evaluation of visual dependent questions, which require visual context to provide a logical response, and visual supplement questions, which rely on visual data as supplementary information to otherwise abstract queries. This dual-structure enables a multi-faceted analysis of how deep learning models in the LVLM category process visual data, especially when there is an inherent language bias that could overshadow visual inputs.

Major Findings and Model Performance

The empirical evaluation highlights key findings about the performance of current state-of-the-art models. Notably, GPT-4V achieved a question-pair accuracy of 31.42%, underscoring both its current capability and limitation when dealing with complex reasoning tasks. In comparison, other advanced LVLMs showcased a performance below 16%, suggesting significant room for improvement in hallucinatory error mitigation.

Diving deeper, the study elucidates that LVLMs, while proficient in some aspects of visual understanding, are pronouncedly prone to two main issues: the illusion of recognition and hallucinatory responses when the models' language biases conflict with visual inputs. This is especially prevalent in controlled settings included in HallusionBench, demonstrating how these biases manifest when evaluating models on accuracy, logical consistency, and robustness across diverse visual modalities.

Implications and Future Directions

A significant implication stemming from this research is the necessity for enhanced training datasets and methodologies that prioritize the balance between language and visual understanding. Developing models with robust visual validation techniques to counteract hallucinatory tendencies could mitigate these issues. By examining identified failure modes, we can formulate strategies for targeted improvements in model architectures to optimize the handling of nuanced contexts.

HallusionBench sets a new benchmark standard for evaluating LVLMs, pushing future research to explore improved approaches that account for the intricacies highlighted by this paper. Potential development might focus on refining models to adeptly balance parametric memory with real-time visual inputs, honing the capacity for temporal reasoning and context-sensitive understanding. The work encourages the continued evolution of benchmark suites that can adapt and challenge models in increasingly complex scenarios, ultimately improving the next generation of LVLMs.

In summary, HallusionBench stands as a pivotal tool for diagnosing and understanding limitations in current visual-LLMs, paving the way for innovative strategies in model training and architecture refinement. By providing a detailed expanse of case studies and metric analyses, the paper offers insightful perspectives into the ongoing challenges and demands in the field of vision-language integration.

Markdown Report Issue