DeiT III: Revenge of the ViT

Published 14 Apr 2022 in cs.CV | (2204.07118v1)

Abstract: A Vision Transformer (ViT) is a simple neural architecture amenable to serve several computer vision tasks. It has limited built-in architectural priors, in contrast to more recent architectures that incorporate priors either about the input data or of specific tasks. Recent works show that ViTs benefit from self-supervised pre-training, in particular BerT-like pre-training like BeiT. In this paper, we revisit the supervised training of ViTs. Our procedure builds upon and simplifies a recipe introduced for training ResNet-50. It includes a new simple data-augmentation procedure with only 3 augmentations, closer to the practice in self-supervised learning. Our evaluations on Image classification (ImageNet-1k with and without pre-training on ImageNet-21k), transfer learning and semantic segmentation show that our procedure outperforms by a large margin previous fully supervised training recipes for ViT. It also reveals that the performance of our ViT trained with supervision is comparable to that of more recent architectures. Our results could serve as better baselines for recent self-supervised approaches demonstrated on ViT.

Abstract PDF Upgrade to Chat

Citations (319)

View on Semantic Scholar

Summary

The paper demonstrates that optimized supervised training, including a novel 3-Augment strategy, significantly boosts Vision Transformer performance.
It introduces efficient cropping (SRC) and optimized loss functions such as binary cross-entropy to enhance model convergence on large datasets.
The study challenges reliance on self-supervised learning by proving competitive results solely with improved supervised methodologies.

Insightful Overview of "DeiT III: Revenge of the ViT"

In the domain of computer vision, transformer-based architectures such as Vision Transformers (ViTs) have recently gained traction as a viable alternative to convolutional neural networks (CNNs). The study "DeiT III: Revenge of the ViT" revisits supervised training procedures for Vision Transformers, shedding light on how these can be optimized to outperform existing methods. This paper refines and simplifies prior methodologies, drawing comparisons primarily with self-supervised learning techniques to establish Vision Transformers as competitive models under supervised settings.

Core Contributions

The key contribution of the paper lies in proposing an improved training method for Vision Transformers using a supervised approach. The method integrates and innovates upon foundational training techniques, which include data augmentation practices typically used in self-supervised learning:

Simplified Data Augmentation: The authors introduce a streamlined augmentation strategy dubbed "3-Augment," consisting of three essential transformations - grayscale, solarization, and Gaussian blur. This approach has proven to be more effective for ViTs compared to more complex augmentation techniques like RandAugment in certain scenarios.
Efficient Cropping Techniques: A significant shift from the traditional Random Resized Crop (RRC) to Simple Random Crop (SRC) is made to diminish the discrepancies in the aspect ratio and object size introduced by RRC, especially on larger datasets like ImageNet-21k.
Optimized Loss Functions: The study employs binary cross-entropy (BCE) loss in place of cross-entropy under certain conditions to enhance performance when combined with techniques like Mixup.
Regularization Enhancements: The introduction of stochastic depth and LayerScale also contribute to refining training by aiding model convergence and accommodating varying model depths.

Results

Performance Benchmarks: The procedure designed supersedes existing fully supervised training methodologies for ViTs on datasets such as ImageNet-1k and ImageNet-21k, realizing performance levels comparable to state-of-the-art architectures.
Resource Efficiency: Despite employing larger architectures, the paper reports reductions in computational demand and memory usage. This efficiency gain is attributed to the low resolution during training, akin to masked autoencoders, which reduces resource consumption.

Implications and Future Directions

The relevance of this study extends beyond proposing a new training approach. It challenges the prevalent narrative that self-supervised learning is indispensable for transforming ViT architectures into competitive models. These findings underscore the potential to achieve competitive results by optimizing supervised training strategies alone, thus reinvigorating interest in exploring efficient supervised learning pathways for vision transformers.

The study opens avenues for further research into refining the training pipelines and loss functions of transformer-based architectures and how they can be harmonized with minimalistic yet effective data augmentation techniques. Additionally, it serves as a benchmark for evaluating future architectures or training paradigms in a supervised setting. As research into self-supervised methodologies intensifies, this paper positions the supervised training of ViTs as an area ripe for advancement and innovation.

In conclusion, the DeiT III study exemplifies that through the precise adjustment of training methodologies, Vision Transformers can achieve high-performance benchmarks without reliance on extensive architectural convolutions, making it a significant contribution to the field of deep learning in computer vision.

Markdown Report Issue