Vision Transformer with Super Token Sampling
Abstract: Vision transformer has achieved impressive performance for many vision tasks. However, it may suffer from high redundancy in capturing local features for shallow layers. Local self-attention or early-stage convolutions are thus utilized, which sacrifice the capacity to capture long-range dependency. A challenge then arises: can we access efficient and effective global context modeling at the early stages of a neural network? To address this issue, we draw inspiration from the design of superpixels, which reduces the number of image primitives in subsequent processing, and introduce super tokens into vision transformer. Super tokens attempt to provide a semantically meaningful tessellation of visual content, thus reducing the token number in self-attention as well as preserving global modeling. Specifically, we propose a simple yet strong super token attention (STA) mechanism with three steps: the first samples super tokens from visual tokens via sparse association learning, the second performs self-attention on super tokens, and the last maps them back to the original token space. STA decomposes vanilla global attention into multiplications of a sparse association map and a low-dimensional attention, leading to high efficiency in capturing global dependencies. Based on STA, we develop a hierarchical vision transformer. Extensive experiments demonstrate its strong performance on various vision tasks. In particular, without any extra training data or label, it achieves 86.4% top-1 accuracy on ImageNet-1K with less than 100M parameters. It also achieves 53.9 box AP and 46.8 mask AP on the COCO detection task, and 51.9 mIOU on the ADE20K semantic segmentation task. Code is released at https://github.com/hhb072/STViT.
- Slic superpixels compared to state-of-the-art superpixel methods. TPAMI, 34(11):2274–2282, 2012.
- Superpixels and polygons using simple non-iterative clustering. In CVPR, pages 4651–4660, 2017.
- Revisiting superpixels for active learning in semantic segmentation with realistic annotation costs. In CVPR, pages 10988–10997, 2021.
- Cascade r-cnn: Delving into high quality object detection. In CVPR, pages 6154–6162, 2018.
- End-to-end object detection with transformers. In ECCV, pages 213–229, 2020.
- Mmdetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155, 2019.
- Twins: Revisiting the design of spatial attention in vision transformers. In NeurIPS, 2021.
- Conditional positional encodings for vision transformers. arXiv preprint arXiv:2102.10882, 2021.
- MMSegmentation Contributors. Mmsegmentation, an open source semantic segmentation toolbox, 2020.
- Randaugment: Practical automated data augmentation with a reduced search space. In CVPRW, pages 702–703, 2020.
- Coatnet: Marrying convolution and attention for all data sizes. 34:3965–3977, 2021.
- Imagenet: A large-scale hierarchical image database. In CVPR, pages 248–255, 2009.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Cswin transformer: A general vision transformer backbone with cross-shaped windows. In CVPR, 2022.
- An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
- Efficient graph-based image segmentation. IJCV, 59(2):167–181, 2004.
- Cmt: Convolutional neural networks meet vision transformers. In CVPR, pages 12175–12185, 2022.
- Mask r-cnn. In ICCV, pages 2961–2969, 2017.
- Deep residual learning for image recognition. pages 770–778, 2016.
- Deep networks with stochastic depth. In ECCV, pages 646–661, 2016.
- Orthogonal transformer: An efficient vision transformer backbone with token orthogonalization. In NeurIPS, 2022.
- Shuffle transformer: Rethinking spatial shuffle for vision transformer. arXiv preprint arXiv:2106.03650, 2021.
- Ccnet: Criss-cross attention for semantic segmentation. In ICCV, pages 603–612, 2019.
- Superpixel sampling networks. In ECCV, pages 352–368, 2018.
- Reformer: The efficient transformer. In ICLR, 2020.
- Mpvit: Multi-path vision transformer for dense prediction. In CVPR, pages 7287–7296, 2022.
- Uniformer: Unified transformer for efficient spatiotemporal representation learning. In ICLR, 2022.
- Mvitv2: Improved multiscale vision transformers for classification and detection. In CVPR, pages 4804–4814, 2022.
- Microsoft coco: Common objects in context. In ECCV, pages 740–755, 2014.
- Entropy rate superpixel segmentation. In CVPR, pages 2097–2104, 2011.
- Manifold slic: A fast method to compute content-sensitive superpixels. In CVPR, pages 651–659, 2016.
- Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, pages 10012–10022, 2021.
- A convnet for the 2020s. In CVPR, pages 11976–11986, 2022.
- Acceleration of stochastic approximation by averaging. SIAM journal on control and optimization, 30(4):838–855, 1992.
- Do vision transformers see like convolutional neural networks? In NeurIPS, volume 34, pages 12116–12128, 2021.
- Shunted self-attention via multi-scale token aggregation. In CVPR, pages 10853–10862, 2022.
- Learning a classification model for segmentation. In ICCV, volume 2, pages 10–10, 2003.
- Self-attention with relative position representations. In NAACL-HLT (2), 2018.
- Segmenter: Transformer for semantic segmentation. In ICCV, pages 7262–7272, 2021.
- Training data-efficient image transformers & distillation through attention. In ICML, pages 10347–10357, 2021.
- Going deeper with image transformers. In ICCV, pages 32–42, 2021.
- Learning superpixels with segmentation-aware affinity loss. In CVPR, pages 568–576, 2018.
- Attention is all you need. In NeurIPS, pages 5998–6008, 2017.
- Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In ICCV, pages 568–578, 2021.
- Crossformer: A versatile vision transformer hinging on cross-scale attention. In ICLR, 2022.
- End-to-end video instance segmentation with transformers. In CVPR, pages 8741–8750, 2021.
- Cvt: Introducing convolutions to vision transformers. In ICCV, pages 22–31, 2021.
- Vision transformer with deformable attention. In CVPR, pages 4794–4803, 2022.
- Unified perceptual parsing for scene understanding. In ECCV, pages 418–434, 2018.
- Aggregated residual transformations for deep neural networks. In CVPR, pages 5987–5995, 2017.
- Superpixel segmentation with fully convolutional networks. In CVPR, pages 13964–13973, 2020.
- Focal self-attention for local-global interactions in vision transformers. In NeurIPS, 2021.
- Superpixel-based tracking-by-segmentation using markov chains. In CVPR, pages 1812–1821, 2017.
- Glance-and-gaze vision transformer. NeurIPS, 34, 2021.
- Tokens-to-token vit: Training vision transformers from scratch on imagenet. In ICCV, pages 558–567, 2021.
- Cutmix: Regularization strategy to train strong classifiers with localizable features. In ICCV, pages 6023–6032, 2019.
- Not all tokens are equal: Human-centric visual analysis via token clustering transformer. In CVPR, pages 11101–11111, 2022.
- mixup: Beyond empirical risk minimization. In ICLR, 2018.
- Improved transformer for high-resolution gans. NeurIPS, 34, 2021.
- End-to-end object detection with adaptive clustering transformer. In BMVC, 2021.
- Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In CVPR, pages 6881–6890, 2021.
- Random erasing data augmentation. In AAAI, volume 34, pages 13001–13008, 2020.
- Semantic understanding of scenes through the ade20k dataset. IJCV, 127(3):302–321, 2019.
- Deformable detr: Deformable transformers for end-to-end object detection. In ICLR, 2020.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.