Papers
Topics
Authors
Recent
Search
2000 character limit reached

Wiki-LLaVA: Hierarchical Retrieval-Augmented Generation for Multimodal LLMs

Published 23 Apr 2024 in cs.CV, cs.AI, cs.CL, and cs.MM | (2404.15406v2)

Abstract: Multimodal LLMs are the natural evolution of LLMs, and enlarge their capabilities so as to work beyond the pure textual modality. As research is being carried out to design novel architectures and vision-and-language adapters, in this paper we concentrate on endowing such models with the capability of answering questions that require external knowledge. Our approach, termed Wiki-LLaVA, aims at integrating an external knowledge source of multimodal documents, which is accessed through a hierarchical retrieval pipeline. Relevant passages, using this approach, are retrieved from the external knowledge source and employed as additional context for the LLM, augmenting the effectiveness and precision of generated dialogues. We conduct extensive experiments on datasets tailored for visual question answering with external data and demonstrate the appropriateness of our approach.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (48)
  1. Flamingo: a Visual Language Model for Few-Shot Learning. In NeurIPS, 2022.
  2. With a Little Help from your own Past: Prototypical Memory Networks for Image Captioning. In ICCV, 2023.
  3. The (R)Evolution of Multimodal Large Language Models: A Survey. arXiv preprint arXiv:2402.12451, 2024.
  4. VisualGPT: Data-Efficient Adaptation of Pretrained Language Models for Image Captioning. In CVPR, 2022.
  5. Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions? In EMNLP, 2023.
  6. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality, 2023.
  7. Scaling Instruction-Finetuned Language Models. arXiv preprint arXiv:2210.11416, 2022.
  8. InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning. arXiv preprint arXiv:2305.06500, 2023.
  9. MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models. arXiv preprint arXiv:2306.13394, 2023.
  10. Retrieval Augmented Language Model Pre-Training. In ICML, 2020.
  11. LoRA: Low-Rank Adaptation of Large Language Models. arXiv preprint arXiv:2106.09685, 2021.
  12. Open-domain Visual Entity Recognition: Towards Recognizing Millions of Wikipedia Entities. In ICCV, 2023a.
  13. REVEAL: Retrieval-Augmented Visual-Language Pre-Training With Multi-Source Multimodal Knowledge Memory. In CVPR, 2023b.
  14. Language Is Not All You Need: Aligning Perception with Language Models. arXiv preprint arXiv:2302.14045, 2023.
  15. Unsupervised Dense Information Retrieval with Contrastive Learning. arXiv preprint arXiv:2112.09118, 2021.
  16. Select, Substitute, Search: A New Benchmark for Knowledge-Augmented Visual Question Answering. In SIGIR, 2021.
  17. Active Retrieval Augmented Generation. arXiv preprint arXiv:2305.06983, 2023.
  18. Billion-Scale Similarity Search with GPUs. IEEE Trans. on Big Data, 7(3):535–547, 2019.
  19. Grounding Language Models to Images for Multimodal Inputs and Outputs. In ICML, 2023.
  20. Cross-modal Retrieval for Knowledge-based Visual Question Answering. arXiv preprint arXiv:2401.05736, 2024.
  21. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. arXiv preprint arXiv:2301.12597, 2023a.
  22. Evaluating Object Hallucination in Large Vision-Language Models. arXiv preprint arXiv:2305.10355, 2023b.
  23. Improved Baselines with Visual Instruction Tuning. arXiv preprint arXiv:2310.03744, 2023a.
  24. Visual Instruction Tuning. In NeurIPS, 2023b.
  25. LLaVA-NeXT: Improved reasoning, OCR, and world knowledge, 2024.
  26. MMBench: Is Your Multi-modal Model an All-around Player? arXiv preprint arXiv:2307.06281, 2023c.
  27. Ok-VQA: A Visual Question Answering Benchmark Requiring External Knowledge. In CVPR, 2019.
  28. Encyclopedic VQA: Visual Questions About Detailed Properties of Fine-Grained Categories. In ICCV, 2023.
  29. WebGPT: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021.
  30. Training Language Models to Follow Instructions with Human Feedback. In NeurIPS, 2022.
  31. Safe-CLIP: Removing NSFW Concepts from Vision-and-Language Models. arXiv preprint arXiv:2311.16254, 2024.
  32. SnapNTell: Enhancing Entity-Centric Visual Question Answering with Retrieval Augmented Multimodal LLM. arXiv preprint arXiv:2403.04735, 2024.
  33. Learning Transferable Visual Models from Natural Language Supervision. In ICML, 2021.
  34. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. JMLR, 21(140):1–67, 2020.
  35. SmallCap: Lightweight Image Captioning Prompted With Retrieval Augmentation. In CVPR, 2023.
  36. Retrieval-Augmented Transformer for Image Captioning. In CBMI, 2022.
  37. Positive-Augmented Contrastive Learning for Image and Video Captioning Evaluation. In CVPR, 2023.
  38. A-OKVQA: A Benchmark for Visual Question Answering Using World Knowledge. In ECCV, 2022.
  39. Learning to Summarize with Human Feedback. In NeurIPS, 2020.
  40. Stanford Alpaca: An Instruction-Following LLaMA Model, 2023.
  41. LLaMA: Open and Efficient Foundation Language Models. arXiv preprint arXiv:2302.13971, 2023.
  42. Multimodal Few-Shot Learning with Frozen Language Models. In NeurIPS, 2021.
  43. Benchmarking Representation Learning for Natural World Image Collections. In CVPR, 2021.
  44. Uniir: Training and Benchmarking Universal Multimodal Information Retrievers. arXiv preprint arXiv:2311.17136, 2023.
  45. Google Landmarks Dataset v2 - A Large-Scale Benchmark for Instance-Level Recognition and Retrieval. In CVPR, 2020.
  46. Grounding Language Models for Visual Entity Recognition. arXiv preprint arXiv:2402.18695, 2024.
  47. MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI. arXiv preprint arXiv:2311.16502, 2023.
  48. MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. arXiv preprint arXiv:2304.10592, 2023.
Citations (22)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 2 likes about this paper.