Beyond Text-Visual Attention: Exploiting Visual Cues for Effective Token Pruning in VLMs

Published 2 Dec 2024 in cs.CV and cs.AI | (2412.01818v2)

Abstract: Large vision-LLMs (LVLMs) generally contain significantly more visual tokens than their textual counterparts, resulting in a considerable computational burden. Recent efforts have been made to tackle this issue by pruning visual tokens early within the LLM. Most existing works use attention scores between text and visual tokens to assess the importance of visual tokens. However, in this study, we first analyze the text-visual attention in the LLM and find that this score is not an ideal indicator for token pruning. Based on the analysis, We propose VisPruner, a plug-and-play method that utilizes visual cues for more effective token pruning in LVLMs. Specifically, we first use visual attention to select a limited number of significant tokens. Then, we remove duplicate tokens from the remaining ones based on their similarity. By retaining diverse tokens alongside the initially selected important tokens, we maximally preserve the visual information of the input image. Experimental results demonstrate that our VisPruner sustains strong performance across various VLM architectures and reduction ratios, significantly outperforming existing methods based on text-visual attention. Notably, without any training, VisPruner can reduce the FLOPs of LLaVA-1.5-7B by 91% and inference latency by 75%, while maintaining comparable performance. Our code is available at https://github.com/Theia-4869/VisPruner.

Abstract PDF HTML Upgrade to Chat

Summary

The paper introduces FasterVLM, a training-free method that leverages [CLS]-based attention to prune up to 95% of visual tokens with minimal performance loss.
It addresses attention shift and dispersion issues, reducing computational overhead and maintaining nearly 90% of original performance.
Empirical tests on VLMs like LLaVA demonstrate faster inference and robust performance across multiple benchmarks.

Analysis of "Attention is All You Need for Training-Free Visual Token Pruning: Make VLM Inference Faster"

The paper addresses a significant challenge associated with the efficient inference of large vision-LLMs (VLMs). Such models, leveraging architectures like CLIP for visual encoding and LLMs such as LLaVA for language generation, often encounter computational bottlenecks due to long sequences of visual tokens. This work introduces a novel method termed FasterVLM, positing a marked improvement in the pruning of visual tokens for acceleration of VLM inference.

Central Contribution and Findings

The primary contention in the paper is that existing methods, which rely heavily on text-visual cross-attentions in LLMs for pruning, are misaligned with the actual significance of visual tokens. This misalignment reportedly results in attention shift and attention dispersion phenomena. Attention shift describes a bias in textual attention to focus more heavily on visual tokens later in the sequence, potentially bypassing critical initial visual information. Attention dispersion indicates a lack of concentration in LLM attention, where many visual tokens receive relatively diffuse attention scores. FasterVLM is proposed to counteract these problems by utilizing cross-attentions between the [CLS] token and image tokens from the visual encoder.

The approach outlined is training-free, applying the pruning process prior to engagement with the LLM, thus achieving substantial reductions in processing time while maintaining high levels of performance. Notably, the ability to prune approximately 95% of visual tokens while retaining 90% of the performance of LLaVA-1.5-7B is highlighted, underscoring the efficacy of the [CLS]-based attention methodology.

Experimental Validation

Empirical evidence is provided through rigorous experimentation across various VLM architectures such as LLaVA, LLaVA-NeXT, and Video-LLaVA. The results demonstrate consistent superiority of FasterVLM over text-visual attention-based approaches at high reduction ratios. Crucially, FasterVLM maintains robustness in performance across a range of multi-modal benchmarks, achieving considerable reductions in FLOPs and computational overhead. For instance, experiments underline that at a 95% token pruning ratio, a performance retention of nearly 89.41% is achieved.

Broader Implications

On a practical level, the introduction of FasterVLM advances the ability of VLMs to operate efficiently even under constrained resource settings, making them more viable for deployment in real-world applications that require rapid, on-the-fly inference. From a theoretical standpoint, this paradigm challenges the prevailing dependence on text-visual cross-attentions, advocating instead for a more visually centered attention strategy using [CLS] tokens. This could stimulate new directions in the optimization of VLM architectures and potentially impact how multi-modal integrations are approached in both academic research and industry applications.

Future Directions

The research opens avenues for further investigation into the nuances of attention mechanisms in VLMs. Exploring the dynamics of attention shift and dispersion at various layers of processing, integrating more sophisticated alignment methods between visual and language modalities, and enhancing token pruning strategies to accommodate varying input data types beyond static images, such as dynamic video frames, are plausible future pursuits.

In conclusion, this paper offers a substantive contribution to the domain of efficient multi-modal inference, advocating for a shift in how significance is derived from visual information in the context of language-vision interaction. The FasterVLM method's ability to streamline the inference process without sacrificing performance paves the way for more agile and effective VLM applications.