Vision Transformer with Super Token Sampling

Published 21 Nov 2022 in cs.CV | (2211.11167v2)

Abstract: Vision transformer has achieved impressive performance for many vision tasks. However, it may suffer from high redundancy in capturing local features for shallow layers. Local self-attention or early-stage convolutions are thus utilized, which sacrifice the capacity to capture long-range dependency. A challenge then arises: can we access efficient and effective global context modeling at the early stages of a neural network? To address this issue, we draw inspiration from the design of superpixels, which reduces the number of image primitives in subsequent processing, and introduce super tokens into vision transformer. Super tokens attempt to provide a semantically meaningful tessellation of visual content, thus reducing the token number in self-attention as well as preserving global modeling. Specifically, we propose a simple yet strong super token attention (STA) mechanism with three steps: the first samples super tokens from visual tokens via sparse association learning, the second performs self-attention on super tokens, and the last maps them back to the original token space. STA decomposes vanilla global attention into multiplications of a sparse association map and a low-dimensional attention, leading to high efficiency in capturing global dependencies. Based on STA, we develop a hierarchical vision transformer. Extensive experiments demonstrate its strong performance on various vision tasks. In particular, without any extra training data or label, it achieves 86.4% top-1 accuracy on ImageNet-1K with less than 100M parameters. It also achieves 53.9 box AP and 46.8 mask AP on the COCO detection task, and 51.9 mIOU on the ADE20K semantic segmentation task. Code is released at https://github.com/hhb072/STViT.

Abstract PDF HTML Upgrade to Chat

References (64)

Citations (42)

View on Semantic Scholar

Summary

The paper introduces Super Token Attention (STA) to aggregate visual tokens, reducing redundancy and computational overhead.
It demonstrates strong empirical performance with 86.4% top-1 accuracy on ImageNet and high object detection/segmentation metrics on COCO.
The STA mechanism offers a practical pathway to scalable Vision Transformers for resource-constrained and real-time applications.

Vision Transformer with Super Token Sampling: An Analytical Examination

The paper "Vision Transformer with Super Token Sampling" elucidates an innovative approach to enhancing the computational efficiency and global contextual modeling of Vision Transformers (ViTs). The authors introduce a novel mechanism termed Super Token Attention (STA) that amalgamates the concept of superpixels from image processing with transformers' attention framework to address redundancy issues inherent in capturing local features.

Motivation and Approach

The existing challenge with Vision Transformers arises from the quadratic complexity of self-attention, particularly in high-resolution visual tasks. This substantial computational burden often results in redundancy, especially in the early layers of the network where local features predominate. The authors propose Super Tokens, a form of spatial reduction inspired by superpixels that aim to semantically tessellate the visual content, effectively reducing token numbers without forfeiting global context.

Super Token Attention Mechanism

The core innovation of this work is the Super Token Attention (STA), a three-step process involving:

Super Token Sampling (STS): Visual tokens are aggregated into fewer super tokens via sparse association learning. This step reduces redundancy and lowers computational demands.
Self-Attention: The reduced set of super tokens undergoes self-attention, enabling the model to capture long-range dependencies more efficiently.
Token Upsampling: The resulting attention-optimized super tokens are mapped back to the original token space, allowing for seamless integration into downstream tasks.

STA cleverly decomposes the conventional global attention mechanism into sparse, low-dimensional multiplicative operations, significantly enhancing efficiency.

Empirical Results

Through extensive empirical validation, the paper demonstrates the efficacy of STA within hierarchical Vision Transformers (STViT) across multiple vision tasks:

Image Classification: Achieving 86.4% top-1 accuracy on ImageNet-1K, STViT outperforms contemporaneous models, showing competitive performance with significantly lower FLOPs.
Object Detection and Instance Segmentation: The introduction of STViT yields robust results, with metrics peaking at 53.9 box AP and 46.8 mask AP on the COCO dataset, surpassing previous benchmarks.
Semantic Segmentation: The method reports a mean Intersection over Union (mIoU) score of 51.9 on ADE20K, verifying its effectiveness in capturing spatial semantics with reduced computational overhead.

Implications and Future Prospects

The introduction of Super Tokens is a compelling augmentation to the standard transformer paradigm, offering a pathway to improved efficiency without sacrificing modeling capacity. This has meaningful implications for deploying transformers in resource-constrained environments or real-time applications.

The theoretical implications suggest potential areas for further development, such as enhancing the robustness of STA to various image scales or integrating it with other efficient transformer architectures. Advancements might also explore broader use cases outside of traditional vision tasks, leveraging the inherent efficiency gains offered by this approach.

In conclusion, this paper contributes substantively to the discourse on transformer efficiency, providing a practical mechanism to balance the demands of computational complexity with the necessity of capturing nuanced global contexts in visual data. The Super Token Attention strategy exemplifies the progressive trajectory of neural architecture research towards more scalable and adaptable frameworks.