Next-ViT: Next Generation Vision Transformer for Efficient Deployment in Realistic Industrial Scenarios

Published 12 Jul 2022 in cs.CV | (2207.05501v4)

Abstract: Due to the complex attention mechanisms and model design, most existing vision Transformers (ViTs) can not perform as efficiently as convolutional neural networks (CNNs) in realistic industrial deployment scenarios, e.g. TensorRT and CoreML. This poses a distinct challenge: Can a visual neural network be designed to infer as fast as CNNs and perform as powerful as ViTs? Recent works have tried to design CNN-Transformer hybrid architectures to address this issue, yet the overall performance of these works is far away from satisfactory. To end these, we propose a next generation vision Transformer for efficient deployment in realistic industrial scenarios, namely Next-ViT, which dominates both CNNs and ViTs from the perspective of latency/accuracy trade-off. In this work, the Next Convolution Block (NCB) and Next Transformer Block (NTB) are respectively developed to capture local and global information with deployment-friendly mechanisms. Then, Next Hybrid Strategy (NHS) is designed to stack NCB and NTB in an efficient hybrid paradigm, which boosts performance in various downstream tasks. Extensive experiments show that Next-ViT significantly outperforms existing CNNs, ViTs and CNN-Transformer hybrid architectures with respect to the latency/accuracy trade-off across various vision tasks. On TensorRT, Next-ViT surpasses ResNet by 5.5 mAP (from 40.4 to 45.9) on COCO detection and 7.7% mIoU (from 38.8% to 46.5%) on ADE20K segmentation under similar latency. Meanwhile, it achieves comparable performance with CSWin, while the inference speed is accelerated by 3.6x. On CoreML, Next-ViT surpasses EfficientFormer by 4.6 mAP (from 42.6 to 47.2) on COCO detection and 3.5% mIoU (from 45.1% to 48.6%) on ADE20K segmentation under similar latency. Our code and models are made public at: https://github.com/bytedance/Next-ViT

Abstract PDF Upgrade to Chat

Citations (125)

View on Semantic Scholar

Summary

The paper presents Next-ViT, an architecture that integrates novel convolution and transformer blocks to enhance both local and global feature extraction.
It employs a Next Hybrid Strategy that strategically combines convolutional and transformer modules across all network stages for robust feature capture.
Empirical results demonstrate that Next-ViT outperforms leading CNNs and ViTs on ImageNet, COCO, and ADE20K tasks while maintaining low inference latency.

Overview of "Next-ViT: Next Generation Vision Transformer for Efficient Deployment in Realistic Industrial Scenarios"

Vision Transformers (ViTs) have demonstrated significant success across a range of computer vision tasks, yet their complex architectures and demanding computational requirements have hindered efficient deployment in real industrial scenarios, particularly when compared to Convolutional Neural Networks (CNNs). This paper addresses this efficiency gap with the introduction of "Next-ViT," an innovative vision Transformer architecture that effectively balances the trade-off between latency and accuracy.

Key Innovations:

The paper introduces several core components aimed at enhancing both computational efficiency and performance:

Next Convolution Block (NCB): Designed to capture local features efficiently, the NCB utilizes a novel Multi-Head Convolutional Attention (MHCA) mechanism. This design choice maintains deployment-friendly operations while improving performance comparable to traditional Transformer blocks.
Next Transformer Block (NTB): The NTB complements the NCB by capturing global and long-term dependencies. It integrates Efficient Multi-Head Self Attention (E-MHSA) and MHCA to blend multi-frequency information, significantly boosting model capabilities without sacrificing efficiency.
Next Hybrid Strategy (NHS): The NHS strategically combines NCBs and NTBs across all network stages, unlike conventional methods that only concentrate Transformer blocks in the deeper layers. This strategy ensures a robust capture of both local and global features throughout the network, enhancing performance in downstream tasks like segmentation and detection.

Empirical Results:

Extensive evaluations demonstrate that Next-ViT achieves superior latency/accuracy trade-offs on multiple hardware platforms including TensorRT and CoreML. Notably, it outperforms existing CNNs, ViTs, and hybrid models across various datasets and tasks:

On ImageNet-1K, Next-ViT outperforms several well-known models such as ResNet101 in terms of accuracy, while maintaining efficient inference speeds.
For COCO detection and ADE20K segmentation, Next-ViT achieves significant improvements in both mAP and mIoU metrics under similar latency conditions.
The model shows comparable performance to state-of-the-art Transformers like CSWin, but with substantially reduced inference times, verifying its practical applicability.

Broader Implications:

The implications of this work stretch across both theoretical and practical landscapes:

Theoretically, the integration of multi-frequency signal processing in Transformers may inspire further research into hybrid architectures that can efficiently harness diverse feature representations.
Practically, Next-ViT's deployment efficiency on diverse hardware platforms makes it a promising candidate for widespread use in mobile and server-based applications, potentially accelerating the adoption of Transformers in industry-grade applications.

Future Directions:

The Next-ViT framework opens several avenues for future exploration. Continued optimization of the Next Hybrid Strategy could refine the balance between computational demand and model performance. Moreover, extending Next-ViT's principles to other domains, such as natural language processing or even more specialized vision applications, may yield additional insights.

In summary, Next-ViT represents a meaningful step forward in the design of efficient, high-performance vision Transformers, providing valuable insights and tools for both academia and industry to deploy more capable models in practical scenarios.

Markdown Report Issue