Alleviating Hallucination in Large Vision-Language Models with Active Retrieval Augmentation
Abstract: Despite the remarkable ability of large vision-LLMs (LVLMs) in image comprehension, these models frequently generate plausible yet factually incorrect responses, a phenomenon known as hallucination.Recently, in LLMs, augmenting LLMs by retrieving information from external knowledge resources has been proven as a promising solution to mitigate hallucinations.However, the retrieval augmentation in LVLM significantly lags behind the widespread applications of LVLM. Moreover, when transferred to augmenting LVLMs, sometimes the hallucination degree of the model is even exacerbated.Motivated by the research gap and counter-intuitive phenomenon, we introduce a novel framework, the Active Retrieval-Augmented large vision-LLM (ARA), specifically designed to address hallucinations by incorporating three critical dimensions: (i) dissecting the retrieval targets based on the inherent hierarchical structures of images. (ii) pinpointing the most effective retrieval methods and filtering out the reliable retrieval results. (iii) timing the retrieval process to coincide with episodes of low certainty, while circumventing unnecessary retrieval during periods of high certainty. To assess the capability of our proposed ARA model in reducing hallucination, we employ three widely used LVLM models (LLaVA-1.5, Qwen-VL, and mPLUG-Owl2) across four benchmarks. Our empirical observations suggest that by utilizing fitting retrieval mechanisms and timing the retrieval judiciously, we can effectively mitigate the hallucination problem. We hope that this study can provide deeper insights into how to adapt the retrieval augmentation to LVLMs for reducing hallucinations with more effective retrieval and minimal retrieval occurrences.
- Qwen technical report. arXiv preprint arXiv:2309.16609 (2023).
- Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. (2023).
- Retrieval augmented convolutional encoder-decoder networks for video captioning. ACM Transactions on Multimedia Computing, Communications and Applications 19, 1s (2023), 1–24.
- Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478 (2023).
- Are We on the Right Way for Evaluating Large Vision-Language Models? arXiv preprint arXiv:2403.20330 (2024).
- Re-imagen: Retrieval-augmented text-to-image generator. arXiv preprint arXiv:2209.14491 (2022).
- HALC: Object Hallucination Reduction via Adaptive Focal-Contrast Decoding. arXiv preprint arXiv:2403.00425 (2024).
- Instructblip: Towards general-purpose vision-language models with instruction tuning. Advances in Neural Information Processing Systems 36 (2024).
- A survey on in-context learning. arXiv preprint arXiv:2301.00234 (2022).
- Detecting and preventing hallucinations in large vision language models. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 18135–18143.
- From images to textual prompts: Zero-shot vqa with frozen large language models. arXiv preprint arXiv:2212.10846 (2022).
- Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation. arXiv preprint arXiv:2311.17911 (2023).
- Drew A Hudson and Christopher D Manning. 2019. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 6700–6709.
- Atlas: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022).
- Active retrieval augmented generation. arXiv preprint arXiv:2305.06983 (2023).
- Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906 (2020).
- Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision 123 (2017), 32–73.
- Mitigating object hallucinations in large vision-language models through visual contrastive decoding. arXiv preprint arXiv:2311.16922 (2023).
- Llava-med: Training a large language-and-vision assistant for biomedicine in one day. Advances in Neural Information Processing Systems 36 (2024).
- Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355 (2023).
- Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer, 740–755.
- A Survey of Attacks on Large Vision-Language Models: Resources, Advances, and Future Trends. arXiv preprint arXiv:2407.07403 (2024).
- Visual Spatial Reasoning. Transactions of the Association for Computational Linguistics (2023).
- Aligning large multi-modal model with robust instruction tuning. arXiv preprint arXiv:2306.14565 (2023).
- Mitigating hallucination in large multi-modal models via robust instruction tuning. In The Twelfth International Conference on Learning Representations.
- Visual instruction tuning. Advances in neural information processing systems 36 (2024).
- Online Robot Navigation and and Manipulation with Distilled Vision-Language Models. arXiv preprint arXiv:2401.17083 (2024).
- Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499 (2023).
- MMBench: Is Your Multi-modal Model an All-around Player? ArXiv abs/2307.06281 (2023). https://api.semanticscholar.org/CorpusID:259837088
- Mitigating Boundary Ambiguity and Inherent Bias for Text Classification in the Era of Large Language Models. arXiv preprint arXiv:2406.07001 (2024).
- Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023).
- Rephrase, augment, reason: Visual grounding of questions for vision-language models. arXiv preprint arXiv:2310.05861 (2023).
- Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.
- In-context retrieval-augmented language models. Transactions of the Association for Computational Linguistics 11 (2023), 1316–1331.
- LMCap: Few-shot Multilingual Image Captioning by Retrieval Augmented Language Model Prompting. arXiv preprint arXiv:2305.19821 (2023).
- Smallcap: lightweight image captioning prompted with retrieval augmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2840–2849.
- Towards Retrieval-Augmented Architectures for Image Captioning. ACM Transactions on Multimedia Computing, Communications and Applications (2024).
- A-okvqa: A benchmark for visual question answering using world knowledge. In European Conference on Computer Vision. Springer, 146–162.
- Lm-nav: Robotic navigation with large pre-trained models of language, vision, and action. In Conference on robot learning. PMLR, 492–504.
- Living in the Moment: Can Large Language Models Grasp Co-Temporal Reasoning? arXiv preprint arXiv:2406.09072 (2024).
- Timo: Towards Better Temporal Reasoning for Language Models. arXiv preprint arXiv:2406.14192 (2024).
- Aligning large multimodal models with factually augmented rlhf. arXiv preprint arXiv:2309.14525 (2023).
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
- Mitigating fine-grained hallucination by fine-tuning large vision-language models with caption rewrites. In International Conference on Multimedia Modeling. Springer, 32–45.
- Chatcad: Interactive computer-aided diagnosis on medical image using large language models. arXiv preprint arXiv:2302.07257 (2023).
- TWIN-GPT: Digital Twins for Clinical Trials via Large Language Model. ACM Transactions on Multimedia Computing, Communications and Applications (2024).
- mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration. arXiv preprint arXiv:2311.04257 (2023).
- A survey on multimodal large language models. arXiv preprint arXiv:2306.13549 (2023).
- Woodpecker: Hallucination correction for multimodal large language models. arXiv preprint arXiv:2310.16045 (2023).
- Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback. arXiv preprint arXiv:2312.00849 (2023).
- MeaCap: Memory-Augmented Zero-shot Image Captioning. arXiv preprint arXiv:2403.03715 (2024).
- Halle-switch: Rethinking and controlling object existence hallucinations in large vision language models for detailed caption. arXiv preprint arXiv:2310.01779 (2023).
- Beyond hallucinations: Enhancing lvlms through hallucination-aware direct preference optimization. arXiv preprint arXiv:2311.16839 (2023).
- Re-ranking person re-identification with k-reciprocal encoding. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1318–1327.
- Analyzing and mitigating object hallucination in large vision-language models. arXiv preprint arXiv:2310.00754 (2023).
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023).
- IBD: Alleviating Hallucinations in Large Vision-Language Models via Image-Biased Decoding. arXiv preprint arXiv:2402.18476 (2024).
- Unified View Empirical Study for Large Pretrained Model on Cross-Domain Few-Shot Learning. ACM Transactions on Multimedia Computing, Communications and Applications (2024).
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.