Hallucination of Multimodal Large Language Models: A Survey

Published 29 Apr 2024 in cs.CV | (2404.18930v2)

Abstract: This survey presents a comprehensive analysis of the phenomenon of hallucination in multimodal LLMs (MLLMs), also known as Large Vision-LLMs (LVLMs), which have demonstrated significant advancements and remarkable abilities in multimodal tasks. Despite these promising developments, MLLMs often generate outputs that are inconsistent with the visual content, a challenge known as hallucination, which poses substantial obstacles to their practical deployment and raises concerns regarding their reliability in real-world applications. This problem has attracted increasing attention, prompting efforts to detect and mitigate such inaccuracies. We review recent advances in identifying, evaluating, and mitigating these hallucinations, offering a detailed overview of the underlying causes, evaluation benchmarks, metrics, and strategies developed to address this issue. Additionally, we analyze the current challenges and limitations, formulating open questions that delineate potential pathways for future research. By drawing the granular classification and landscapes of hallucination causes, evaluation benchmarks, and mitigation methods, this survey aims to deepen the understanding of hallucinations in MLLMs and inspire further advancements in the field. Through our thorough and in-depth review, we contribute to the ongoing dialogue on enhancing the robustness and reliability of MLLMs, providing valuable insights and resources for researchers and practitioners alike. Resources are available at: https://github.com/showlab/Awesome-MLLM-Hallucination.

Abstract PDF Upgrade to Chat

Citations (75)

View on Semantic Scholar

Summary

The paper outlines the primary causes of hallucination in multimodal LLMs, emphasizing data quality, model architecture, and inference challenges.
The paper details evaluation metrics and benchmarks such as CHAIR, POPE, AMBER, and HallusionBench to quantify hallucination effectively.
The paper discusses mitigation strategies including data enhancement, improved cross-modal alignment, and inference adjustments to ensure robust outputs.

Hallucination of Multimodal LLMs: A Survey

Introduction

The phenomenon of hallucination in multimodal LLMs (MLLMs), particularly in Large Vision-LLMs (LVLMs), has emerged as a significant concern, impacting their deployment and reliability in real-world applications. Despite their advanced capabilities in tasks such as image captioning and visual question answering, MLLMs exhibit a propensity to generate outputs misaligned with the provided visual content, termed hallucination. This paper provides an extensive survey of recent advancements in identifying, evaluating, and mitigating hallucinations in MLLMs to enhance their robustness and reliability.

Defining Hallucination in MLLMs

Hallucination in MLLMs refers to the discrepancy between text generated by the model and the visual content it is supposed to describe. This phenomenon can manifest as factuality hallucination, which involves discrepancies between the model output and real-world facts, or as faithfulness hallucination, where the output diverges from user instructions or context. In MLLMs, the focus is primarily on cross-modal inconsistencies, notably object hallucinations, which are further categorized into incorrect identification of object categories, attributes, and relations.

Figure 1: Three types of typical hallucination.

Causes of Hallucination

Hallucination in MLLMs stems from various factors during data acquisition, model training, and inference:

Data-Related Causes: Hallucinations can arise due to insufficient data, low-quality data which may include noise, and a lack of diversity in training datasets, contributing to biases. The frequent co-occurrence of certain objects within datasets also exacerbates hallucination issues.
Model Architecture and Training: The architecture of MLLMs, often built with separate pre-trained vision and LLMs linked by a cross-modal interface, can lead to hallucinations if the interface alignment is inadequate. Training strategies such as auto-regressive next token prediction, commonly used in LLMs, are insufficient for MLLMs due to their complex spatial data structures.
Inference Issues: During inference, the dilution of visual attention over long sequence generation can result in outputs that are less grounded in visual reality, contributing to hallucination.

Evaluation Metrics and Benchmarks

Assessing hallucination requires meticulous metrics and benchmarks. These include CHAIR for measuring hallucinations in captions and POPE which evaluates object hallucination in a binary manner. The use of LLMs like GPT-4 for assessing generated responses is also common, though it involves substantial computational costs. Newer benchmarks like AMBER and HallusionBench offer more nuanced evaluations by integrating vision-based and textual assessments.

Mitigation Strategies

To address hallucinations, diverse strategies are employed:

Data-Centric Mitigations: Enhancing data quality and diversity through counterfactual data generation and noise reduction processes are crucial approaches.
Model Enhancements: Improving cross-modal alignment through enhanced model architectures and scaling up the visual resolution of encoders help reduce hallucinations.
Training Adjustments: Incorporating additional learning objectives such as auxiliary supervision and applying reinforcement learning techniques offer pathways to reduce hallucination.
Inference Adjustments: Techniques such as contrastive decoding and guided sentence generation refine output generations to improve consistency with visual inputs.

Future Directions

Future research directions include improving data acquisition strategies, enhancing model architectures for better cross-modal alignment, and developing standardized benchmarks for consistent evaluation. Moreover, leveraging hallucination capabilities tactically as a feature rather than strictly a flaw, could uncover creative applications for MLLMs beyond current use cases.

Conclusion

This survey underscores the challenges hallucination poses to the deployment and trustworthiness of MLLMs. By exploring its causes, evaluation methods, and mitigation strategies, we aim to inspire further research for more reliable and robust multimodal systems. Such advances would significantly contribute to the practical and ethical deployment of these sophisticated AI models in real-world scenarios.