Papers
Topics
Authors
Recent
Search
2000 character limit reached

Leveraging Retrieval-Augmented Tags for Large Vision-Language Understanding in Complex Scenes

Published 16 Dec 2024 in cs.CV | (2412.11396v1)

Abstract: Object-aware reasoning in vision-language tasks poses significant challenges for current models, particularly in handling unseen objects, reducing hallucinations, and capturing fine-grained relationships in complex visual scenes. To address these limitations, we propose the Vision-Aware Retrieval-Augmented Prompting (VRAP) framework, a generative approach that enhances Large Vision-LLMs (LVLMs) by integrating retrieval-augmented object tags into their prompts. VRAP introduces a novel pipeline where structured tags, including objects, attributes, and relationships, are extracted using pretrained visual encoders and scene graph parsers. These tags are enriched with external knowledge and incorporated into the LLM's input, enabling detailed and accurate reasoning. We evaluate VRAP across multiple vision-language benchmarks, including VQAv2, GQA, VizWiz, and COCO, achieving state-of-the-art performance in fine-grained reasoning and multimodal understanding. Additionally, our ablation studies highlight the importance of retrieval-augmented tags and contrastive learning, while human evaluations confirm VRAP's ability to generate accurate, detailed, and contextually relevant responses. Notably, VRAP achieves a 40% reduction in inference latency by eliminating runtime retrieval. These results demonstrate that VRAP is a robust and efficient framework for advancing object-aware multimodal reasoning.

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.