Video Swin Transformer

Published 24 Jun 2021 in cs.CV, cs.AI, and cs.LG | (2106.13230v1)

Abstract: The vision community is witnessing a modeling shift from CNNs to Transformers, where pure Transformer architectures have attained top accuracy on the major video recognition benchmarks. These video models are all built on Transformer layers that globally connect patches across the spatial and temporal dimensions. In this paper, we instead advocate an inductive bias of locality in video Transformers, which leads to a better speed-accuracy trade-off compared to previous approaches which compute self-attention globally even with spatial-temporal factorization. The locality of the proposed video architecture is realized by adapting the Swin Transformer designed for the image domain, while continuing to leverage the power of pre-trained image models. Our approach achieves state-of-the-art accuracy on a broad range of video recognition benchmarks, including on action recognition (84.9 top-1 accuracy on Kinetics-400 and 86.1 top-1 accuracy on Kinetics-600 with ~20x less pre-training data and ~3x smaller model size) and temporal modeling (69.6 top-1 accuracy on Something-Something v2). The code and models will be made publicly available at https://github.com/SwinTransformer/Video-Swin-Transformer.

Abstract PDF Upgrade to Chat

Citations (1,293)

View on Semantic Scholar

Summary

The paper introduces Video Swin Transformer, which adapts the Swin Transformer for video by leveraging spatiotemporal locality to balance speed and accuracy.
It employs a hierarchical design with 3D patch partitioning and shifted window-based self-attention to optimize local computations.
Empirical results show state-of-the-art performance on benchmarks like Kinetics-400, Kinetics-600, and Something-Something v2 with reduced computational costs.

An Analysis of "Video Swin Transformer"

The paper "Video Swin Transformer" presents an innovative approach to video recognition by adapting the Swin Transformer, originally designed for image recognition, to handle video data. The authors propose a novel architecture that introduces an inductive bias of locality in video Transformers to balance speed and accuracy efficiently.

Introduction and Motivation

The landscape of visual modeling has seen a significant shift from Convolutional Neural Networks (CNNs) to Transformer-based architectures. Pioneering models like Vision Transformer (ViT) demonstrated that Transformer architectures could outperform CNNs on image recognition tasks by globally modeling spatial relationships. This paper builds on this premise but recognizes that extending such global self-attention mechanisms naively to videos incurs prohibitive computation costs. Therefore, the authors advocate for an inductive bias of locality to efficiently scale Transformers to video tasks.

Architecture

The proposed Video Swin Transformer adapts the Swin Transformer for videos by leveraging the inherent spatiotemporal locality within video frames. The central idea is that pixels close in spatiotemporal distance have higher correlation, allowing for efficient local self-attention computations.

Key Architectural Components

3D Patch Partitioning: The video input is partitioned into non-overlapping 3D patches, which are then embedded into a higher-dimensional space.
Hierarchical Structure: Following the original Swin Transformer, the video model employs a hierarchical architecture with 2× spatial downsampling at each stage.
3D Shifted Window Based Multi-Head Self-Attention (MSA): This mechanism introduces locality by computing self-attention within non-overlapping 3D windows. To introduce cross-window connections, the windows are periodically shifted, akin to Swin Transformer's approach for images but extended to the spatiotemporal domain.
Relative Position Bias: A 3D relative position bias is incorporated into the self-attention mechanism to account for spatial and temporal relationships more effectively.

Variants and Initialization

The authors explore several variants of the architecture, designated as Swin-T, Swin-S, Swin-B, and Swin-L, varying in model size and computational complexity. The model benefits from strong initialization by leveraging weights pre-trained on large-scale image datasets like ImageNet-21K.

Empirical Results

The proposed Video Swin Transformer achieves state-of-the-art performance on benchmark video recognition datasets, including Kinetics-400 (K400), Kinetics-600 (K600), and Something-Something v2 (SSv2).

Key Results:

Kinetics-400: Achieves 84.9% top-1 accuracy, outperforming previous state-of-the-art models like ViViT-H with significantly less pre-training data and smaller model size.
Kinetics-600: Similar remarkable performance with an 86.1% top-1 accuracy.
Something-Something v2: Demonstrates strong temporal modeling capabilities with a top-1 accuracy of 69.6%.

Implications and Future Directions

The Video Swin Transformer demonstrates that incorporating locality in transformer architectures is beneficial for video tasks, leading to improvements in computational efficiency and model performance. The findings suggest several directions for future research:

Scalability: Further investigation into scaling the temporal dimension for longer video sequences while maintaining computational efficiency.
Initialization: Exploring advanced strategies for utilizing pre-trained image model weights, particularly focusing on the differences between inflate and center initialization methods.
Temporal Dynamics: Enhanced modeling of complex temporal dynamics, possibly incorporating a more nuanced handling of temporal attention mechanisms.

Conclusion

The proposed Video Swin Transformer marks a significant advancement in video recognition. By capitalizing on the spatiotemporal locality, the model achieves a superior speed-accuracy trade-off, paving the way for more efficient and effective video Transformer models. The public availability of the code and models further ensures that this approach can be a foundation for future research and development in video AI.