Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer

Published 7 Jun 2021 in cs.CV | (2106.03650v1)

Abstract: Very recently, Window-based Transformers, which computed self-attention within non-overlapping local windows, demonstrated promising results on image classification, semantic segmentation, and object detection. However, less study has been devoted to the cross-window connection which is the key element to improve the representation ability. In this work, we revisit the spatial shuffle as an efficient way to build connections among windows. As a result, we propose a new vision transformer, named Shuffle Transformer, which is highly efficient and easy to implement by modifying two lines of code. Furthermore, the depth-wise convolution is introduced to complement the spatial shuffle for enhancing neighbor-window connections. The proposed architectures achieve excellent performance on a wide range of visual tasks including image-level classification, object detection, and semantic segmentation. Code will be released for reproduction.

Abstract PDF Upgrade to Chat

Citations (168)

View on Semantic Scholar

Summary

The paper introduces a novel architecture that uses spatial shuffle operations to enhance long-range connectivity in Vision Transformers.
It integrates window-based self-attention with spatial alignment and depth-wise convolutions to overcome the grid issue in high-resolution images.
Empirical results demonstrate superior top-1 accuracy on ImageNet and improved mIoU and AP metrics on segmentation tasks.

Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer

The paper presents the Shuffle Transformer, a novel architecture in the domain of Vision Transformers (ViTs) that effectively revisits the concept of spatial shuffle to improve cross-window connections in vision tasks. This exploration is pertinent as window-based ViTs such as Swin have shown remarkable efficiency by computing self-attention within non-overlapping local windows. However, these models still face limitations when it comes to forming cross-window connections, especially those required for tasks demanding large receptive fields.

Conceptual Approach

The Shuffle Transformer builds upon the window-based multi-head self-attention mechanism by introducing a spatial shuffle operation reminiscent of ShuffleNet's channel shuffle. This spatial shuffle operation efficiently enables long-range information flow across different non-overlapping windows without significantly increasing computational complexity. To achieve this, the spatial shuffle operation is coupled with an inverse process, termed as the spatial alignment operation, to realign features with image content effectively.

Architectural Enhancements

The authors propose integrating a depth-wise convolution layer accompanied by residual connections after the window-based self-attention module. This enhances neighbor-window connections, addressing the "grid issue" which surfaces when image resolution significantly exceeds window size. The resulting structure, known as the Shuffle Transformer Block, alternates between the basic window multi-head self-attention and the shuffle-enhanced variant, achieving linear computational complexity relative to the number of input tokens.

Empirical Evaluations

Extensive experiments underscore the performance advantages of the Shuffle Transformer. On ImageNet-1K for image classification, Shuffle Transformer variants achieve top-1 accuracy superior to previous architectures with comparable computational demands, notably outperforming Swin Transformers. Similarly, when evaluated for semantic segmentation on ADE20K and instance segmentation on COCO, the Shuffle Transformer exhibits superior mIoU and AP metrics, reinforcing its effectiveness across various vision challenges.

Implications and Future Trajectories

The proposed Shuffle Transformer represents a significant step toward more efficient and performance-robust ViTs, presenting practical applications in fields needing high-fidelity image processing capabilities, such as autonomous vehicles and medical imaging. The innovative solution to cross-window integration without exorbitant computational costs potentially paves the way for further developments in transformer architectures, making them even more conducive to high-resolution imagery analysis.

Future explorations may explore further optimizing spatial shuffle operations or integrating more advanced convolution strategies to enhance connectivity. Additionally, broadening the application of such architectures to multi-modal data could yield fascinating insights into the versatility and extension of the presented methods in broader AI contexts.