Papers
Topics
Authors
Recent
Search
2000 character limit reached

Vote&Mix: Plug-and-Play Token Reduction for Efficient Vision Transformer

Published 30 Aug 2024 in cs.CV | (2408.17062v1)

Abstract: Despite the remarkable success of Vision Transformers (ViTs) in various visual tasks, they are often hindered by substantial computational cost. In this work, we introduce Vote&Mix (\textbf{VoMix}), a plug-and-play and parameter-free token reduction method, which can be readily applied to off-the-shelf ViT models \textit{without any training}. VoMix tackles the computational redundancy of ViTs by identifying tokens with high homogeneity through a layer-wise token similarity voting mechanism. Subsequently, the selected tokens are mixed into the retained set, thereby preserving visual information. Experiments demonstrate VoMix significantly improves the speed-accuracy tradeoff of ViTs on both images and videos. Without any training, VoMix achieves a 2$\times$ increase in throughput of existing ViT-H on ImageNet-1K and a 2.4$\times$ increase in throughput of existing ViT-L on Kinetics-400 video dataset, with a mere 0.3\% drop in top-1 accuracy.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (48)
  1. Is Space-Time Attention All You Need for Video Understanding? In International Conference on Machine Learning, 813–824. PMLR.
  2. Is space-time attention all you need for video understanding? In ICML, volume 2, 4.
  3. Token Merging: Your ViT But Faster. In The Eleventh International Conference on Learning Representations.
  4. Hydra attention: Efficient attention with many heads. In European Conference on Computer Vision, 35–49. Springer.
  5. Vision transformer slimming: Multi-dimension searching in continuous optimization space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4931–4941.
  6. Chasing sparsity in vision transformers: An end-to-end exploration. Advances in Neural Information Processing Systems, 34: 19974–19988.
  7. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35: 16344–16359.
  8. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, 248–255. Ieee.
  9. Prune spatio-temporal tokens by semantic-aware temporal accumulation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 16945–16956.
  10. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations.
  11. Adaptive token sampling for efficient vision transformers. In European Conference on Computer Vision, 396–414. Springer.
  12. Masked autoencoders as spatiotemporal learners. Advances in neural information processing systems, 35: 35946–35958.
  13. The” something something” video database for learning and evaluating visual common sense. In Proceedings of the IEEE international conference on computer vision, 5842–5850.
  14. Which tokens to use? investigating token reduction in vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 773–783.
  15. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 16000–16009.
  16. All tokens matter: Token labeling for training better vision transformers. Advances in neural information processing systems, 34: 18590–18602.
  17. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950.
  18. Token Fusion: Bridging the Gap between Token Pruning and Token Merging. In IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2024, Waikoloa, HI, USA, January 3-8, 2024, 1372–1381. IEEE.
  19. Spvit: Enabling faster vision transformers via latency-aware soft token pruning. In European Conference on Computer Vision, 620–640. Springer.
  20. Mvitv2: Improved multiscale vision transformers for classification and detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4804–4814.
  21. Q-vit: Fully differentiable quantization for vision transformer. arXiv preprint arXiv:2201.07703.
  22. EViT: Expediting Vision Transformers via Token Reorganizations. In International Conference on Learning Representations.
  23. Fq-vit: Post-training quantization for fully quantized vision transformer. arXiv preprint arXiv:2111.13824.
  24. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, 10012–10022.
  25. Video swin transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 3202–3211.
  26. Beyond Attentive Tokens: Incorporating Token Importance and Diversity for Efficient Vision Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10334–10343.
  27. Token pooling in vision transformers for image classification. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 12–21.
  28. Adavit: Adaptive vision transformers for efficient image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12309–12318.
  29. IA-RED 2: Interpretability-Aware Redundancy Reduction for Vision Transformers. Advances in Neural Information Processing Systems, 34: 24898–24911.
  30. Scalable vision transformers with hierarchical pooling. In Proceedings of the IEEE/cvf international conference on computer vision, 377–386.
  31. What Do Self-Supervised Vision Transformers Learn? In The Eleventh International Conference on Learning Representations.
  32. Keeping your eye on the ball: Trajectory attention in video transformers. Advances in neural information processing systems, 34: 12493–12506.
  33. Dynamicvit: Efficient vision transformers with dynamic token sparsification. Advances in neural information processing systems, 34: 13937–13949.
  34. Learning to merge tokens in vision transformers. arXiv preprint arXiv:2202.12015.
  35. Efficient attention: Attention with linear complexities. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, 3531–3539.
  36. Revisiting weakly supervised pre-training of visual perception models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 804–814.
  37. Cp-vit: Cascade vision transformer pruning via progressive sparsity prediction. arXiv preprint arXiv:2203.04570.
  38. How to train your vit? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270.
  39. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Advances in neural information processing systems, 35: 10078–10093.
  40. Training data-efficient image transformers & distillation through attention. In International conference on machine learning, 10347–10357. PMLR.
  41. Attention is all you need. Advances in neural information processing systems, 30.
  42. Videomae v2: Scaling video masked autoencoders with dual masking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 14549–14560.
  43. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF international conference on computer vision, 568–578.
  44. Joint Token Pruning and Squeezing Towards More Aggressive Compression of Vision Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2092–2101.
  45. Groupvit: Semantic segmentation emerges from text supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 18134–18144.
  46. A-vit: Adaptive tokens for efficient vision transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10809–10818.
  47. Not all tokens are equal: Human-centric visual analysis via token clustering transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11101–11111.
  48. Self-slimmed vision transformer. In European Conference on Computer Vision, 432–448. Springer.

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.