Vote&Mix: Plug-and-Play Token Reduction for Efficient Vision Transformer
Abstract: Despite the remarkable success of Vision Transformers (ViTs) in various visual tasks, they are often hindered by substantial computational cost. In this work, we introduce Vote&Mix (\textbf{VoMix}), a plug-and-play and parameter-free token reduction method, which can be readily applied to off-the-shelf ViT models \textit{without any training}. VoMix tackles the computational redundancy of ViTs by identifying tokens with high homogeneity through a layer-wise token similarity voting mechanism. Subsequently, the selected tokens are mixed into the retained set, thereby preserving visual information. Experiments demonstrate VoMix significantly improves the speed-accuracy tradeoff of ViTs on both images and videos. Without any training, VoMix achieves a 2$\times$ increase in throughput of existing ViT-H on ImageNet-1K and a 2.4$\times$ increase in throughput of existing ViT-L on Kinetics-400 video dataset, with a mere 0.3\% drop in top-1 accuracy.
- Is Space-Time Attention All You Need for Video Understanding? In International Conference on Machine Learning, 813–824. PMLR.
- Is space-time attention all you need for video understanding? In ICML, volume 2, 4.
- Token Merging: Your ViT But Faster. In The Eleventh International Conference on Learning Representations.
- Hydra attention: Efficient attention with many heads. In European Conference on Computer Vision, 35–49. Springer.
- Vision transformer slimming: Multi-dimension searching in continuous optimization space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4931–4941.
- Chasing sparsity in vision transformers: An end-to-end exploration. Advances in Neural Information Processing Systems, 34: 19974–19988.
- Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35: 16344–16359.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, 248–255. Ieee.
- Prune spatio-temporal tokens by semantic-aware temporal accumulation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 16945–16956.
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations.
- Adaptive token sampling for efficient vision transformers. In European Conference on Computer Vision, 396–414. Springer.
- Masked autoencoders as spatiotemporal learners. Advances in neural information processing systems, 35: 35946–35958.
- The” something something” video database for learning and evaluating visual common sense. In Proceedings of the IEEE international conference on computer vision, 5842–5850.
- Which tokens to use? investigating token reduction in vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 773–783.
- Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 16000–16009.
- All tokens matter: Token labeling for training better vision transformers. Advances in neural information processing systems, 34: 18590–18602.
- The kinetics human action video dataset. arXiv preprint arXiv:1705.06950.
- Token Fusion: Bridging the Gap between Token Pruning and Token Merging. In IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2024, Waikoloa, HI, USA, January 3-8, 2024, 1372–1381. IEEE.
- Spvit: Enabling faster vision transformers via latency-aware soft token pruning. In European Conference on Computer Vision, 620–640. Springer.
- Mvitv2: Improved multiscale vision transformers for classification and detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4804–4814.
- Q-vit: Fully differentiable quantization for vision transformer. arXiv preprint arXiv:2201.07703.
- EViT: Expediting Vision Transformers via Token Reorganizations. In International Conference on Learning Representations.
- Fq-vit: Post-training quantization for fully quantized vision transformer. arXiv preprint arXiv:2111.13824.
- Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, 10012–10022.
- Video swin transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 3202–3211.
- Beyond Attentive Tokens: Incorporating Token Importance and Diversity for Efficient Vision Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10334–10343.
- Token pooling in vision transformers for image classification. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 12–21.
- Adavit: Adaptive vision transformers for efficient image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12309–12318.
- IA-RED 2: Interpretability-Aware Redundancy Reduction for Vision Transformers. Advances in Neural Information Processing Systems, 34: 24898–24911.
- Scalable vision transformers with hierarchical pooling. In Proceedings of the IEEE/cvf international conference on computer vision, 377–386.
- What Do Self-Supervised Vision Transformers Learn? In The Eleventh International Conference on Learning Representations.
- Keeping your eye on the ball: Trajectory attention in video transformers. Advances in neural information processing systems, 34: 12493–12506.
- Dynamicvit: Efficient vision transformers with dynamic token sparsification. Advances in neural information processing systems, 34: 13937–13949.
- Learning to merge tokens in vision transformers. arXiv preprint arXiv:2202.12015.
- Efficient attention: Attention with linear complexities. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, 3531–3539.
- Revisiting weakly supervised pre-training of visual perception models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 804–814.
- Cp-vit: Cascade vision transformer pruning via progressive sparsity prediction. arXiv preprint arXiv:2203.04570.
- How to train your vit? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270.
- Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Advances in neural information processing systems, 35: 10078–10093.
- Training data-efficient image transformers & distillation through attention. In International conference on machine learning, 10347–10357. PMLR.
- Attention is all you need. Advances in neural information processing systems, 30.
- Videomae v2: Scaling video masked autoencoders with dual masking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 14549–14560.
- Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF international conference on computer vision, 568–578.
- Joint Token Pruning and Squeezing Towards More Aggressive Compression of Vision Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2092–2101.
- Groupvit: Semantic segmentation emerges from text supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 18134–18144.
- A-vit: Adaptive tokens for efficient vision transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10809–10818.
- Not all tokens are equal: Human-centric visual analysis via token clustering transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11101–11111.
- Self-slimmed vision transformer. In European Conference on Computer Vision, 432–448. Springer.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.