CoreMatching: A Co-adaptive Sparse Inference Framework with Token and Neuron Pruning for Comprehensive Acceleration of Vision-Language Models

Published 25 May 2025 in cs.LG and cs.CV | (2505.19235v1)

Abstract: Vision-LLMs (VLMs) excel across diverse tasks but suffer from high inference costs in time and memory. Token sparsity mitigates inefficiencies in token usage, while neuron sparsity reduces high-dimensional computations, both offering promising solutions to enhance efficiency. Recently, these two sparsity paradigms have evolved largely in parallel, fostering the prevailing assumption that they function independently. However, a fundamental yet underexplored question remains: Do they truly operate in isolation, or is there a deeper underlying interplay that has yet to be uncovered? In this paper, we conduct the first comprehensive investigation into this question. By introducing and analyzing the matching mechanism between Core Neurons and Core Tokens, we found that key neurons and tokens for inference mutually influence and reinforce each other. Building on this insight, we propose CoreMatching, a co-adaptive sparse inference framework, which leverages the synergy between token and neuron sparsity to enhance inference efficiency. Through theoretical analysis and efficiency evaluations, we demonstrate that the proposed method surpasses state-of-the-art baselines on ten image understanding tasks and three hardware devices. Notably, on the NVIDIA Titan Xp, it achieved 5x FLOPs reduction and a 10x overall speedup. Code is released at https://github.com/wangqinsi1/2025-ICML-CoreMatching/tree/main.

Abstract PDF Upgrade to Chat

Summary

CoreMatching: A Co-adaptive Sparse Inference Framework

The paper titled "CoreMatching: A Co-adaptive Sparse Inference Framework with Token and Neuron Pruning for Comprehensive Acceleration of Vision-Language Models" explores methods to accelerate inference in Vision-Language Models (VLMs) by synergistically utilizing sparsity in both token usage and neuron activation. The authors address the inefficiencies inherent in VLMs due to the high computational demands imposed by long image-token input sequences, which exceed those typically seen in Large Language Models (LLMs).

Key Contributions and Findings

Co-adaptive Sparse Inference Framework: The authors introduce CoreMatching, a framework that combines token and neuron sparsity for more efficient VLM inference. This approach simultaneously addresses inefficiencies in token usage and reduces high-dimensional neuron computations, providing comprehensive acceleration during inference.
Interplay Between Token and Neuron Sparsity: Traditionally, token and neuron sparsity have evolved as separate paradigms. This paper challenges the assumption of their independence by proposing a matching mechanism between core neurons and core tokens. The findings suggest that these elements mutually reinforce each other, allowing for more efficient inference processes.
Core Tokens and Core Neurons: The concept of Core Tokens is introduced, defined as tokens activating the largest number of core neurons. Experimental evidence supports that retaining only core tokens leads to nearly lossless performance. Core neurons are the subset most frequently activated, critical for maintaining inference efficiency, achieving significant computational reductions without sacrificing output quality.
Theoretical and Empirical Validation: CoreMatching was evaluated theoretically and empirically across multiple tasks and hardware platforms. On the NVIDIA Titan Xp, CoreMatching achieves a 5× reduction in floating point operations (FLOPs) and a 10× speedup in overall inference time, surpassing state-of-the-art baselines on ten image understanding tasks.

Implications for Future Research and Applications

CoreMatching opens avenues for efficiently deploying VLMs on resource-constrained devices by reducing memory usage and computational demands. It demonstrates the potential to achieve substantial inference acceleration across diverse tasks and hardware architectures. The paper provides valuable insights into the interplay between token and neuron sparsity, suggesting that future work might further explore integrated sparsity approaches in other models and architectures.

Speculation on Future Developments

The exploration into synergy between different forms of sparsity in machine learning models holds promise for further enhancing the performance and efficiency of AI applications. As models continue to scale, understanding and leveraging sparsity could lead to breakthroughs not only in vision-language tasks but also across other domains such as natural language processing, computational biology, and beyond. Future studies might delve deeper into adaptive mechanisms that dynamically select and prune computational elements during runtime, driven by real-time inputs and contexts.

In summary, the paper makes significant contributions to the efficient handling of computational loads in VLMs using a co-adaptive sparse inference strategy, paving the way for broader applications and continued innovations in efficient AI model deployment.