TokenCarve: Information-Preserving Visual Token Compression in Multimodal Large Language Models

Published 13 Mar 2025 in cs.CV | (2503.10501v1)

Abstract: Multimodal LLMs (MLLMs) are becoming increasingly popular, while the high computational cost associated with multimodal data input, particularly from visual tokens, poses a significant challenge. Existing training-based token compression methods improve inference efficiency but require costly retraining, while training-free methods struggle to maintain performance when aggressively reducing token counts. In this study, we reveal that the performance degradation of MLLM closely correlates with the accelerated loss of information in the attention output matrix. This insight introduces a novel information-preserving perspective, making it possible to maintain performance even under extreme token compression. Based on this finding, we propose TokenCarve, a training-free, plug-and-play, two-stage token compression framework. The first stage employs an Information-Preservation-Guided Selection (IPGS) strategy to prune low-information tokens, while the second stage further leverages IPGS to guide token merging, minimizing information loss. Extensive experiments on 11 datasets and 2 model variants demonstrate the effectiveness of TokenCarve. It can even reduce the number of visual tokens to 22.2% of the original count, achieving a 1.23x speedup in inference, a 64% reduction in KV cache storage, and only a 1.54% drop in accuracy. Our code is available at https://github.com/ShawnTan86/TokenCarve.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a training-free, plug-and-play framework that preserves essential visual information during token compression.
It employs a two-stage process using attention-guided token selection and merging to reduce tokens to 22.2% of the original with a 1.54% accuracy drop.
The method achieves a 1.23× inference speedup and cuts KV cache storage by 64%, enhancing computational efficiency in MLLMs.

TokenCarve: Information-Preserving Visual Token Compression in Multimodal LLMs

Introduction

The paper "TokenCarve: Information-Preserving Visual Token Compression in Multimodal LLMs" (2503.10501) presents an innovative approach to the challenge of efficiently compressing visual tokens in Multimodal LLMs (MLLMs). The primary issue addressed is the significant computational overhead associated with processing visual tokens, which form a major component of the input data in MLLMs. Traditional methods either require extensive retraining or fail to preserve performance under high compression rates.

Methodology

The authors propose TokenCarve, a training-free, plug-and-play framework designed to compress visual tokens significantly while preserving the essential information necessary for maintaining model performance. The framework comprises two stages:

Information-Preservation-Guided Selection (IPGS): This stage utilizes the rank of the attention output matrix to identify and prune low-information tokens. By combining information contribution scores, derived via singular value decomposition, with attention scores from the transformer layers, IPGS effectively retains tokens that are critical for information richness.
Token Merging: In this stage, tokens are further compressed through merging based on cosine similarity and contribution scores, thus minimizing information loss. This is achieved by categorizing tokens into a higher-scored Set A and a lower-scored Set B, merging the latter with similar tokens in the former.
Figure 1: The pipeline of the proposed TokenCarve framework.

Experimental Results

The experimental evaluation of TokenCarve covers 11 datasets, demonstrating its capability to reduce visual token counts to 22.2% of the original while maintaining high accuracy. The key findings include:

Performance Reduction: Only a 1.54% drop in accuracy is observed despite the aggressive token reduction.
Inference Speedup: Achieving an inference speedup of 1.23×, coupled with a 64% reduction in KV cache storage, substantiates the framework's efficiency.
Resilience Under Compression: TokenCarve consistently outperforms existing training-free methods, especially when significant token reductions are applied.
Figure 2: TokenCarve's performance retains close proximity to the uncompressed version even with substantial token reduction.

Figure 3: Correlation between MLLM performance and visual token information under compression, indicating robustness in performance.

Implications and Future Directions

The implications of this work are profound for the deployment of MLLMs, where computational efficiency is critical. By minimizing the information loss during token compression, TokenCarve not only reduces computational demands but also maintains high performance, paving the way for more scalable multimodal AI applications.

Future research directions could explore optimizing the architectural integration of TokenCarve to further enhance the inference speed, possibly through hardware enhancements or more efficient computation of positional encodings. Moreover, expanding the framework's applicability to other model architectures could generalize its benefits across different multimodal tasks.

Conclusion

TokenCarve presents a significant advancement in the field of token compression for MLLMs by addressing the critical balance between computational efficiency and information preservation. This framework sets a new standard for token reduction methodologies by achieving substantial compression with minimal performance degradation, offering a viable path forward for the efficient deployment of multimodal AI systems.

Markdown Report Issue