Hallucination-Aware Multimodal Benchmark for Gastrointestinal Image Analysis with Large Vision-Language Models

Published 11 May 2025 in cs.CV and cs.LG | (2505.07001v2)

Abstract: Vision-LLMs (VLMs) are becoming increasingly popular in the medical domain, bridging the gap between medical images and clinical language. Existing VLMs demonstrate an impressive ability to comprehend medical images and text queries to generate detailed, descriptive diagnostic medical reports. However, hallucination--the tendency to generate descriptions that are inconsistent with the visual content--remains a significant issue in VLMs, with particularly severe implications in the medical field. To facilitate VLM research on gastrointestinal (GI) image analysis and study hallucination, we curate a multimodal image-text GI dataset: Gut-VLM. This dataset is created using a two-stage pipeline: first, descriptive medical reports of Kvasir-v2 images are generated using ChatGPT, which introduces some hallucinated or incorrect texts. In the second stage, medical experts systematically review these reports, and identify and correct potential inaccuracies to ensure high-quality, clinically reliable annotations. Unlike traditional datasets that contain only descriptive texts, our dataset also features tags identifying hallucinated sentences and their corresponding corrections. A common approach to reducing hallucination in VLM is to finetune the model on a small-scale, problem-specific dataset. However, we take a different strategy using our dataset. Instead of finetuning the VLM solely for generating textual reports, we finetune it to detect and correct hallucinations, an approach we call hallucination-aware finetuning. Our results show that this approach is better than simply finetuning for descriptive report generation. Additionally, we conduct an extensive evaluation of state-of-the-art VLMs across several metrics, establishing a benchmark. GitHub Repo: https://github.com/bhattarailab/Hallucination-Aware-VLM.

Abstract PDF Upgrade to Chat

Summary

Hallucination-Aware Multimodal Benchmark for Gastrointestinal Image Analysis with Large Vision-LLMs

This paper addresses the critical issue of hallucination in Vision-LLMs (VLMs) when applied to medical data, specifically gastrointestinal (GI) image analysis. Hallucination, in this context, refers to the generation of descriptions by VLMs that do not align with the actual visual content of medical images, potentially leading to significant clinical implications. To investigate this phenomenon and offer a platform for future research, the authors introduce the Gut-VLM dataset, a curated multimodal image-text corpus with advanced annotations for hallucination study.

Gut-VLM is constructed through a meticulous two-stage process. Initially, descriptive medical reports for Kvasir-v2 images are generated using ChatGPT, which, while effective, introduces some degree of hallucination. These preliminary reports then undergo expert validation, where gastroenterologists identify and rectify inaccuracies, flagging hallucinated sentences and their corresponding corrections. This approach enriches the dataset, offering both descriptive diagnostics and specific annotations on VLM-induced hallucinations, thus facilitating research aimed at improving the reliability of VLMs in medical contexts.

A noteworthy aspect of this research is the introduction of the hallucination-aware finetuning strategy. Unlike traditional finetuning methods that focus solely on generating correct textual reports from images, this strategy trains VLMs to specifically detect and correct hallucinated information. This methodology arguably mirrors human learning processes, wherein active correction leads to a deeper understanding rather than mere passive recapitulation of data.

The study evaluates several state-of-the-art VLMs, including LLaVA-1.6-7B, Qwen2-7B, mPLUG-Owl-2B, and DeepSeek-7B-VL, using the Gut-VLM dataset. The models undergo both normal finetuning and the proposed hallucination-aware finetuning. Results indicate that the latter significantly enhances the model's ability to produce accurate and clinically relevant outputs. For instance, LLaVA-1.6-7B, when finetuned using the hallucination-aware strategy, achieved a Question Answering Accuracy Score (QAAS) of 90.89%, a substantial improvement over the 83.07% achieved through standard methods.

Furthermore, the authors propose novel evaluation metrics, namely R-Sim for semantic similarity and QAAS for accuracy in addressing diagnostic questions. These metrics serve to comprehensively assess the VLM outputs against expert-verified ground truths, ensuring more meaningful performance measurement in medical applications.

While the paper provides a substantial contribution to the field, it acknowledges limitations such as dataset bias towards ChatGPT's output structure and the granularity of hallucination annotations. Future work may focus on expanding the dataset with diverse VLM outputs and finer-grained annotation techniques. Additionally, exploring other strategies to mitigate hallucination, such as uncertainty estimation and improved feature integration, remains an open avenue for research.

Overall, the paper's findings underscore the potential for hallucination-aware training strategies to significantly enhance the applicability and trustworthiness of VLMs in medical diagnostics, particularly in the critical area of gastrointestinal image analysis. The Gut-VLM dataset and methodologies outlined here could also serve as a foundation for broader applications in medical AI, paving the way for more robust, reliable systems in clinical settings.

Markdown Report Issue