TinyViT: Fast Pretraining Distillation for Small Vision Transformers

Published 21 Jul 2022 in cs.CV | (2207.10666v1)

Abstract: Vision transformer (ViT) recently has drawn great attention in computer vision due to its remarkable model capability. However, most prevailing ViT models suffer from huge number of parameters, restricting their applicability on devices with limited resources. To alleviate this issue, we propose TinyViT, a new family of tiny and efficient small vision transformers pretrained on large-scale datasets with our proposed fast distillation framework. The central idea is to transfer knowledge from large pretrained models to small ones, while enabling small models to get the dividends of massive pretraining data. More specifically, we apply distillation during pretraining for knowledge transfer. The logits of large teacher models are sparsified and stored in disk in advance to save the memory cost and computation overheads. The tiny student transformers are automatically scaled down from a large pretrained model with computation and parameter constraints. Comprehensive experiments demonstrate the efficacy of TinyViT. It achieves a top-1 accuracy of 84.8% on ImageNet-1k with only 21M parameters, being comparable to Swin-B pretrained on ImageNet-21k while using 4.2 times fewer parameters. Moreover, increasing image resolutions, TinyViT can reach 86.5% accuracy, being slightly better than Swin-L while using only 11% parameters. Last but not the least, we demonstrate a good transfer ability of TinyViT on various downstream tasks. Code and models are available at https://github.com/microsoft/Cream/tree/main/TinyViT.

Abstract PDF Upgrade to Chat

Citations (173)

View on Semantic Scholar

Summary

The paper introduces a fast pretraining distillation framework that transfers knowledge from large models to small vision transformers.
The methodology leverages sparse teacher outputs and MBConvs to reduce computational load while achieving competitive accuracy on benchmarks like ImageNet-1k.
Implications include efficient model deployment on resource-constrained devices and improved transferability across various downstream tasks.

Overview of TinyViT: Fast Pretraining Distillation for Small Vision Transformers

The paper introduces TinyViT, a suite of compact and efficient Vision Transformers aimed at reducing the computational and storage demands characterizing existing Vision Transformer (ViT) models. This innovation is pivotal in advancing their deployment on resource-constrained devices such as mobile and IoT gadgets. The cornerstone of this study is a pretraining framework that employs a fast distillation technique to effectively transfer knowledge from large, pretrained models to smaller, more compact ViT architectures.

Key Contributions and Methodology

This research is grounded in two primary contributions:

Fast Pretraining Distillation Framework: A novel approach is designed to pretrain small models using hierarchical vision transformers. This framework notably includes a sparse representation of the teacher model's outputs to economize memory and accelerate computation. Logits produced by large teacher models are precomputed, sparsely retained, and distilled into tiny student models, allowing them to benefit from substantial pretrained datasets without excessive training overhead. Distillation takes place during the pretraining phase rather than at the later stages, a shift from traditional methods. This pretraining phase harnesses stored teacher outputs to obviate need to compute these on-the-fly during student model training.
Tiny Vision Transformer Architectures: These are derived through a process of model contraction from larger architectures, ensuring computational and parameter efficiency. The design leverages practices such as employing MBConvs for earlier layers to infuse inductive bias, thereby conserving resource use while maintaining robustness.

Experimental Validation

The paper presents comprehensive experiments to validate the efficacy of TinyViT. Here are some of the benchmark results:

ImageNet-1k Classification: TinyViT achieves a top-1 accuracy of 84.8% on ImageNet-1k using only 21M parameters, comparable to larger models like Swin-B, which requires 4.2 times more parameters. Moreover, when evaluated at increased image resolutions, its performance scales up to 86.5% accuracy, marginally exceeding Swin-L with only 11% of the parameter count.
Efficiency: The fast pretraining distillation framework reduces the computational cost significantly. The model pretraining with this method is up to 29.8% more efficient compared to conventional techniques. The distillation strategy diminishes the need for the repetitive and intensive task of processing images through large teacher networks during training.
Transferability to Downstream Tasks: Beyond merely classification, TinyViT demonstrates enhanced transfer capabilities across a range of downstream tasks (e.g., object detection with Cascade R-CNN) compared to its contemporaries.

Theoretical and Practical Implications

Practically, TinyViT provides a pathway to deploy advanced vision models on devices with limited computational resources, potentially democratizing access to sophisticated AI technologies across various application domains. Theoretically, it underscores a shift towards knowledge transfer methodologies that maximize the utility of pretraining datasets without incurring the hefty resource demands typical of large-scale deep learning models.

Future Directions

The study opens up several avenues for future exploration:

Model Contraction Optimization: While model contraction proved effective, further refinement could enhance efficiency and accuracy trade-offs.
Exploration of Distillation Techniques: Integrating more complex augmentation strategies and optimizing logit storage methodologies could enhance distillation framework robustness.
Expanded Dataset Utilization: Extending pretraining to exploit even larger, more diverse datasets could further enhance TinyViT's generalization abilities.

In summary, this work presents an impactful stride in vision transformers by marrying computational efficiency with proficient knowledge transfer techniques, showing promise for a variety of real-world applications.

Markdown Report Issue