EVCap: Retrieval-Augmented Image Captioning with External Visual-Name Memory for Open-World Comprehension

Published 27 Nov 2023 in cs.CV | (2311.15879v2)

Abstract: LLMs-based image captioning has the capability of describing objects not explicitly observed in training data; yet novel objects occur frequently, necessitating the requirement of sustaining up-to-date object knowledge for open-world comprehension. Instead of relying on large amounts of data and/or scaling up network parameters, we introduce a highly effective retrieval-augmented image captioning method that prompts LLMs with object names retrieved from External Visual--name memory (EVCap). We build ever-changing object knowledge memory using objects' visuals and names, enabling us to (i) update the memory at a minimal cost and (ii) effortlessly augment LLMs with retrieved object names by utilizing a lightweight and fast-to-train model. Our model, which was trained only on the COCO dataset, can adapt to out-of-domain without requiring additional fine-tuning or re-training. Our experiments conducted on benchmarks and synthetic commonsense-violating data show that EVCap, with only 3.97M trainable parameters, exhibits superior performance compared to other methods based on frozen pre-trained LLMs. Its performance is also competitive to specialist SOTAs that require extensive training.

Abstract PDF HTML Upgrade to Chat

References (46)

Citations (11)

View on Semantic Scholar

Summary

The paper introduces a retrieval-augmented framework that dynamically updates object knowledge using a lightweight external visual-name memory.
It employs a frozen Vicuna-13B LLM decoder and an attentive fusion module to refine image captions, achieving competitive CIDEr scores on COCO, NoCaps, and Flickr30K.
Its design enables robust open-world captioning with only 3.97M trainable parameters, offering scalability and efficiency for dynamic real-world applications.

Retrieval-Augmented Image Captioning with External Visual–Name Memory

Image captioning (IC) has seen substantial advancements through the application of LLMs, allowing for comprehensive descriptions of images based on extensive datasets. However, the static nature and high computational demands of such models present challenges in adapting to novel objects that frequently emerge in open-world settings. The paper introduces an innovative method, EVCAP, which aims to enhance the dynamic comprehension and adaptability of image captioning systems without the need for expansive datasets or extensive computational resources.

Overview of EVCAP

EVCAP proposes a retrieval-augmented approach that leverages a minimal yet effective external visual-name memory to update object knowledge dynamically. The model integrates a lightweight and easily expandable external memory, which consists of visual features and their corresponding object names. This structure allows the model to retrieve relevant object names as prompts for a frozen pre-trained LLM decoder when generating captions.

Key Components

The EVCAP architecture is built upon several components:

External Visual-Name Memory: This memory contains visual features as keys and object names as values, allowing for efficient retrieval of object-specific descriptions.
Image Encoding Module: Utilizing a frozen vision encoder, EVCAP extracts visual features, augmented by trainable image query tokens, facilitating precise object name retrieval from the memory.
Attentive Fusion Module: This module performs cross-attention between retrieved object names and visual features to refine the captioning process, mitigating redundant or irrelevant data incorporation.
Frozen LLM Decoder: EVCAP employs a frozen Vicuna-13B model that utilizes the fused prompt of object names and visual features to generate the final captions.

Experimental Results

EVCAP demonstrates remarkable performance across standard IC benchmarks including COCO, NoCaps, and Flickr30K, with improvements in CIDEr scores. Remarkably, it achieves competitive results with only 3.97M trainable parameters, a testament to its efficiency in comparison to other state-of-the-art models requiring considerably larger computational resources. The evaluations show EVCAP competently handles both in-domain and out-of-domain data, underscoring its robustness in diverse settings.

Moreover, the integration of commonsense-violating images from the WHOOPS dataset affirms EVCAP's adaptability. When the external memory is updated with data from the WHOOPS dataset, the model shows notable improvement in handling novel, unconventional scenarios, reflecting its extendibility and practical applicability.

Implications and Future Directions

EVCAP stands as a seminal contribution toward sustainable and scalable image captioning solutions adaptable to ever-evolving real-world scenarios. The minimal cost of memory updates and adaptability without retraining provide a paradigm shift in the economic feasibility of maintaining up-to-date object knowledge. This is critical for deploying image captioning technologies in dynamic domains such as autonomous driving and real-time analytics.

The paper opens avenues for further exploration in retrieval-augmented methodologies, highlighting potential integrations with object detection systems to enhance the completeness of image descriptions. Future research could also examine the application of this approach across other multimodal tasks, potentially redefining how external memory is utilized for understanding complex image and text relationships in LLMs.

In conclusion, EVCAP presents a sophisticated yet resource-efficient framework for image captioning, balancing precision and adaptability, essential for the advancement of open-world AI comprehension.