EVCap: Retrieval-Augmented Image Captioning with External Visual-Name Memory for Open-World Comprehension
Abstract: LLMs-based image captioning has the capability of describing objects not explicitly observed in training data; yet novel objects occur frequently, necessitating the requirement of sustaining up-to-date object knowledge for open-world comprehension. Instead of relying on large amounts of data and/or scaling up network parameters, we introduce a highly effective retrieval-augmented image captioning method that prompts LLMs with object names retrieved from External Visual--name memory (EVCap). We build ever-changing object knowledge memory using objects' visuals and names, enabling us to (i) update the memory at a minimal cost and (ii) effortlessly augment LLMs with retrieved object names by utilizing a lightweight and fast-to-train model. Our model, which was trained only on the COCO dataset, can adapt to out-of-domain without requiring additional fine-tuning or re-training. Our experiments conducted on benchmarks and synthetic commonsense-violating data show that EVCap, with only 3.97M trainable parameters, exhibits superior performance compared to other methods based on frozen pre-trained LLMs. Its performance is also competitive to specialist SOTAs that require extensive training.
- https://huggingface.co/spaces/RitaParadaRamos/SmallCapDemo.
- Nocaps: Novel object captioning at scale. In Proc. IEEE International Conference on Computer Vision (ICCV), 2019.
- Flamingo: A visual language model for few-shot learning. In Advances in Neural Information Processing Systems (NeurIPS), 2022.
- Bottom-up and top-down attention for image captioning and visual question answering. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
- Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023.
- Breaking common sense: Whoops! a vision-and-language benchmark of synthetic and compositional images. In Proc. IEEE International Conference on Computer Vision (ICCV), 2023.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
- PaLI-X: On scaling up a multilingual vision and language model. arXiv preprint arXiv:2305.18565, 2023a.
- PaLI: A jointly-scaled multilingual language-image model. In International Conference on Learning Representations (ICLR), 2023b.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2023.
- Instructblip: Towards general-purpose vision-language models with instruction tuning. In Advances in Neural Information Processing Systems (NeurIPS), 2023.
- An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (ICLR), 2021.
- Rca-noc: Relative contrastive alignment for novel object captioning. In Proc. IEEE International Conference on Computer Vision (ICCV), 2023.
- Eva: Exploring the limits of masked visual representation learning at scale. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- Transferable decoding with visual entities for zero-shot image captioning. In Proc. IEEE International Conference on Computer Vision (ICCV), 2023.
- Zhengcong Fei. Memory-augmented image captioning. In Proc. AAAI Conference on Artificial Intelligence (AAAI), 2021.
- LVIS: A dataset for large vocabulary instance segmentation. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
- Will large-scale generative models corrupt future datasets? In Proc. IEEE International Conference on Computer Vision (ICCV), 2023.
- Deep compositional captioning: Describing novel object categories without paired training data. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
- Reveal: Retrieval-augmented visual-language pre-training with multi-source multimodal knowledge memory. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- Billion-scale similarity search with gpus. IEEE Transactions on Big Data, 7(3):535–547, 2019.
- Deep visual-semantic alignments for generating image descriptions. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
- Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), 2019.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proc. International conference on machine learning (ICML), 2022.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In Proc. International conference on machine learning (ICML), 2023.
- Oscar: Object-semantics aligned pre-training for vision-language tasks. In Proc. European Conference on Computer Vision (ECCV), 2020.
- Microsoft coco: Common objects in context. In Proc. European Conference on Computer Vision (ECCV), 2014.
- Neural baby talk. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
- Clipcap: Clip prefix for image captioning. arXiv preprint arXiv:2111.09734, 2021.
- Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proc. IEEE International Conference on Computer Vision (ICCV), 2015.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Learning transferable visual models from natural language supervision. In Proc. International conference on machine learning (ICML), 2021.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research (JMLR), 21(1):5485–5551, 2020.
- Retrieval-augmented image captioning. In Proc. Conference of the European Chapter of the Association for Computational Linguistics (EACL), 2023a.
- Smallcap: Lightweight image captioning prompted with retrieval augmentation. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023b.
- High-resolution image synthesis with latent diffusion models. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research (JMLR), 9(11), 2008.
- Captioning images with diverse objects. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
- Noc-rek: Novel object captioning with retrieved vocabulary from external knowledge. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079, 2023.
- Show, attend and tell: Neural image caption generation with visual attention. In Proc. International conference on machine learning (ICML).
- Re-ViLM: Retrieval-augmented visual language model for zero and few-shot image captioning. In Findings of the Association for Computational Linguistics: EMNLP, 2023.
- Retrieval-augmented multimodal language modeling. In International Conference on Learning Representations (ICLR), 2023.
- Vinvl: Revisiting visual representations in vision-language models. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.