Woodpecker: Hallucination Correction for Multimodal Large Language Models

Published 24 Oct 2023 in cs.CV, cs.AI, cs.CL, and cs.LG | (2310.16045v2)

Abstract: Hallucination is a big shadow hanging over the rapidly evolving Multimodal LLMs (MLLMs), referring to the phenomenon that the generated text is inconsistent with the image content. In order to mitigate hallucinations, existing studies mainly resort to an instruction-tuning manner that requires retraining the models with specific data. In this paper, we pave a different way, introducing a training-free method named Woodpecker. Like a woodpecker heals trees, it picks out and corrects hallucinations from the generated text. Concretely, Woodpecker consists of five stages: key concept extraction, question formulation, visual knowledge validation, visual claim generation, and hallucination correction. Implemented in a post-remedy manner, Woodpecker can easily serve different MLLMs, while being interpretable by accessing intermediate outputs of the five stages. We evaluate Woodpecker both quantitatively and qualitatively and show the huge potential of this new paradigm. On the POPE benchmark, our method obtains a 30.66%/24.33% improvement in accuracy over the baseline MiniGPT-4/mPLUG-Owl. The source code is released at https://github.com/BradyFU/Woodpecker.

Abstract PDF Upgrade to Chat

Citations (80)

View on Semantic Scholar

Summary

The paper introduces Woodpecker, a novel training-free framework that leverages pre-trained models to correct hallucinations in MLLMs, improving accuracy by up to 30.66%.
The methodology consists of key concept extraction, question formulation, visual validation, knowledge base generation, and final correction using bounding box visualizations.
Evaluation on benchmarks such as POPE, MME, and LLaVA-QA90 demonstrates enhanced object and attribute-level accuracy in multimodal outputs.

"Woodpecker: Hallucination Correction for Multimodal LLMs" Analysis

Introduction

The paper "Woodpecker: Hallucination Correction for Multimodal LLMs" (2310.16045) addresses the prevalent issue of hallucinations in Multimodal LLMs (MLLMs). Hallucinations refer to scenarios where generated texts are inconsistent with the corresponding input images. This phenomenon poses significant challenges to the practical applicability of MLLMs. While earlier approaches have primarily focused on instruction-tuning methods necessitating retraining with specific data, this paper introduces a novel, training-free methodology known as Woodpecker, which directly corrects hallucinations post-generation.

Methodology

Woodpecker is designed to be a training-free framework that leverages pre-trained models for hallucination correction across five distinct stages:

Key Concept Extraction: Identifies the primary objects and concepts in the generated text that are likely to exhibit hallucinations. This extraction relies on the use of LLMs to parse key concepts effectively.
Question Formulation: Constructs questions around the extracted key concepts, targeting both object-level and attribute-level hallucinations. Questions are formulated to validate the existence, number, and attributes of objects mentioned in the text.
Visual Knowledge Validation: Utilizes expert models such as open-set object detectors and Visual Question Answering (VQA) systems to answer the formulated questions. These models provide information about object existence and attributes without the need for retraining.
Visual Claim Generation: Converts the validated question-answer pairs into a structured visual knowledge base. This base serves as a comprehensive repository for object-level and attribute-level claims about the image.
Hallucination Correction: An LLM uses the visual knowledge base to refine hallucinations in the original text, ensuring improved accuracy and reliability. Bounding boxes are included for referenced objects to enhance interpretability.
Figure 1: Framework of Woodpecker. Given an image and a query, an MLLM outputs the corresponding response. Through various steps, a visual knowledge base is created for hallucination correction.

Results

The Woodpecker framework was evaluated on multiple datasets including POPE, MME, and LLaVA-QA90 to determine its efficacy in correcting hallucinations.

POPE Benchmark: Across different settings (random, popular, and adversarial), the Woodpecker framework enhanced the accuracy significantly for baseline MLLMs. Notably, the accuracy improvements were 30.66% and 24.33% for MiniGPT-4 and mPLUG-Owl, respectively.
MME Benchmark: The framework demonstrated robustness in both object-level and attribute-level hallucination correction. Importantly, it improved scores in the existence and count categories, showcasing its utility in addressing object-level hallucinations efficiently.
Figure 2: Results on MME with different framework variants. Comparison of default models and those utilizing components of the Woodpecker framework.
LLaVA-QA90 Evaluation: Using a GPT-4V-based evaluation, the framework consistently improved the accuracy and detailedness of corrected responses. This evaluation underscored the framework’s ability to refine descriptions with enhanced precision and additional detail.
Figure 3: Illustration of GPT-4V-aided evaluation.

Implications and Future Work

Woodpecker stands out as a pioneering framework that addresses hallucinations without the need for expensive retraining processes, leveraging pre-existing models. This advancement implies significant practical implications for deploying MLLMs in real-world applications, ensuring more reliable and interpretable outputs.

Future directions of this work could explore the expansion of the framework to handle broader ranges of hallucinations across more diverse datasets. Additionally, enhancing the capacity of the system to interpret more complex interactions and relationships within images can further bolster the practical utility of MLLMs.

Conclusion

"Woodpecker: Hallucination Correction for Multimodal LLMs" presents a compelling approach to tackling hallucinations in generated text descriptions from MLLMs. The proposed framework combines interpretability with practicality by utilizing existing models and avoiding retraining. Substantial improvements across multiple datasets affirm Woodpecker’s potential as an invaluable tool for refining multimodal outputs and advancing the state of MLLM reliability.

Markdown Report Issue