LLM-enhanced Action-aware Multi-modal Prompt Tuning for Image-Text Matching

Published 30 Jun 2025 in cs.CV | (2506.23502v2)

Abstract: Driven by large-scale contrastive vision-language pre-trained models such as CLIP, recent advancements in the image-text matching task have achieved remarkable success in representation learning. Due to image-level visual-language alignment, CLIP falls short in understanding fine-grained details such as object attributes and spatial relationships between objects. Recent efforts have attempted to compel CLIP to acquire structured visual representations by introducing prompt learning to achieve object-level alignment. While achieving promising results, they still lack the capability to perceive actions, which are crucial for describing the states or relationships between objects. Therefore, we propose to endow CLIP with fine-grained action-level understanding by introducing an LLM-enhanced action-aware multi-modal prompt-tuning method, incorporating the action-related external knowledge generated by LLMs. Specifically, we design an action triplet prompt and an action state prompt to exploit compositional semantic knowledge and state-related causal knowledge implicitly stored in LLMs. Subsequently, we propose an adaptive interaction module to aggregate attentive visual features conditioned on action-aware prompted knowledge for establishing discriminative and action-aware visual representations, which further improves the performance. Comprehensive experimental results on two benchmark datasets demonstrate the effectiveness of our method.

Abstract PDF Upgrade to Chat

Summary

The paper introduces an LLM-enhanced method that integrates action triplet prompts with CLIP to capture fine-grained action semantics for image-text matching.
Multi-modal prompt tuning and an adaptive interaction module refine visual-text alignment, demonstrating significant improvements in retrieval metrics on COCO and Flickr30K.
Experimental results showcase enhanced R@1, R@5, and R@10 performance, validating the method's capability to mitigate action-level mismatches in complex scenarios.

Introduction

The paper "LLM-enhanced Action-aware Multi-modal Prompt Tuning for Image-Text Matching" presents a novel approach to improving image-text matching leveraging the capabilities of LLMs and CLIP. This method addresses the limitations of traditional CLIP models, particularly their inability to grasp fine-grained action-level semantics crucial for aligning images and text. The authors introduce a technique that integrates action-related knowledge derived from LLMs into the CLIP framework via multi-modal prompt tuning to enrich the semantic understanding of actions and enhance image-text matching tasks.

Figure 1: Image-to-Text Matching.

Methodology

The proposed method enhances CLIP by introducing LLM-generated action triplets and state awareness, which are then embedded into the multi-modal framework through designed prompts. The process involves several critical steps:

Action Knowledge Generation: Utilizing the in-context learning capabilities of GPT-3.5, the system generates action triplets and state descriptions. These are used to construct prompts that represent fine-grained action semantics implicitly stored within LLMs. Each prompt is infused with detailed compositional and causal information about object actions.
Multi-modal Prompt Tuning: The methodology injects action triplet prompts and action state prompts into the image encoder. These prompts are tailored to guide the encoder to focus on significant action cues within the visual content, effectively bridging the gap between high-level textual instructions and the visual representations drawn by CLIP.
Adaptive Interaction Module: This module enhances feature extraction by focusing only on salient action cues relevant to the visual content, thereby reducing noise and irrelevant information interference. It facilitates adaptive alignment between visual and textual representations conditioned on action-aware prompted knowledge.
Figure 2: Overview of the proposed method.

Experimental Results

Experiments conducted on the COCO and Flickr30K datasets demonstrate the method's efficacy. The proposed system outperforms existing methods on standard benchmarks by significantly improving retrieval accuracy. The metrics reveal robust performance in both image-to-text and text-to-image retrieval tasks, attributed to the enhanced action-aware patterns captured in the learned embeddings.

Quantitative Analysis: The results show a notable increase in retrieval accuracy across various backbone architectures. The model exhibits improved R@1, R@5, and R@10 metrics, reflecting enhanced retrieval robustness due to action-specific knowledge incorporation. The performance gains highlight the significance of integrating LLMs with vision-LLMs.
Figure 3: The statistical analysis of inconsistent actions between the query and the candidate in image-text matching using CLIP on the Flickr30K test set.
Qualitative Analysis: Visualization of retrieval examples indicates that the method effectively captures and discriminates action contexts, correcting mismatches prevalent in traditional CLIP approaches. The qualitative assessments underline the model's capacity to discern fine-grained action semantics, especially in complex scenarios.

Implementation Considerations

Implementing this method requires access to powerful LLMs like GPT-3.5 or its successors for generating action-related prompts. The integration into existing CLIP architectures demands modifications in prompt handling and interpretation to accommodate action-specific data. Additionally, the adaptive interaction module necessitates careful tuning of hyperparameters to balance between action-related and general visual information.

Future Directions

The integration of LLMs into vision-language tasks opens avenues for further exploration of context-specific interactions and learning paradigms. Extending this approach to more dynamic datasets involving temporal actions in videos could leverage action-aware perceptions for broader applications including video-text matching and action recognition. Moreover, improving LLM prompt designs could further bridge semantic gaps in other AI-driven multi-modal tasks.

Conclusion

The paper contributes a significant advancement in image-text matching by leveraging the strengths of LLM-enhanced prompting techniques. By addressing the limitations of traditional image-text encoders in action perception, this research paves the way for more sophisticated integration of LLMs with visual understanding systems. This enhancement promises improvements in various applications within the multi-modal domain, utilizing fine-grained action knowledge to better contextualize and align images with textual descriptions.

Markdown