CoAtNet: Marrying Convolution and Attention for All Data Sizes

Published 9 Jun 2021 in cs.CV and cs.LG | (2106.04803v2)

Abstract: Transformers have attracted increasing interests in computer vision, but they still fall behind state-of-the-art convolutional networks. In this work, we show that while Transformers tend to have larger model capacity, their generalization can be worse than convolutional networks due to the lack of the right inductive bias. To effectively combine the strengths from both architectures, we present CoAtNets(pronounced "coat" nets), a family of hybrid models built from two key insights: (1) depthwise Convolution and self-Attention can be naturally unified via simple relative attention; (2) vertically stacking convolution layers and attention layers in a principled way is surprisingly effective in improving generalization, capacity and efficiency. Experiments show that our CoAtNets achieve state-of-the-art performance under different resource constraints across various datasets: Without extra data, CoAtNet achieves 86.0% ImageNet top-1 accuracy; When pre-trained with 13M images from ImageNet-21K, our CoAtNet achieves 88.56% top-1 accuracy, matching ViT-huge pre-trained with 300M images from JFT-300M while using 23x less data; Notably, when we further scale up CoAtNet with JFT-3B, it achieves 90.88% top-1 accuracy on ImageNet, establishing a new state-of-the-art result.

Abstract PDF Upgrade to Chat

Citations (1,057)

View on Semantic Scholar

Summary

The paper presents a novel hybrid architecture that fuses convolution with self-attention, enhancing model performance across varied data sizes.
It employs a vertically stacked design that leverages convolution for local feature extraction in early stages and self-attention for global context in deeper layers.
The CoAtNet model achieves state-of-the-art results on benchmarks like ImageNet-1K and ImageNet-21K, demonstrating efficiency by requiring significantly less training data than comparable models.

CoAtNet: Marrying Convolution and Attention for All Data Sizes

The paper "CoAtNet: Marrying Convolution and Attention for All Data Sizes" introduces a family of hybrid neural network models that leverage both convolutional layers and self-attention mechanisms. Through a systematic study on the generalizability and model capacity of these hybrid structures, the authors—Zihang Dai, Hanxiao Liu, Quoc V. Le, and Mingxing Tan from Google Research—propose design principles that enhance performance across various data size regimes, achieving state-of-the-art (SOTA) accuracy on multiple image classification benchmarks.

Motivation and Background

The research is motivated by the distinct yet complementary advantages presented by Convolutional Neural Networks (ConvNets) and Transformer models. ConvNets have dominated the computer vision landscape since the advent of AlexNet, owing to their strong inductive biases favoring local spatial hierarchies. In contrast, Transformers, renowned for their success in natural language processing, have demonstrated superior model capacity and scalability, particularly exemplified through Vision Transformers (ViT). However, vanilla Transformers have exhibited shortcomings in generalization when faced with limited data, which motivates this investigation into a hybrid approach.

Key Insights

The authors anchor their model design on two fundamental insights:

Unified Depthwise Convolution and Self-Attention via Relative Attention: The authors propose that depthwise convolution functions can be seamlessly integrated into self-attention mechanisms using relative position encodings. This approach retains translation equivalence—a desirable property for generalization.
Vertically Stacked Layer Architectures: By principled vertical stacking of convolutional layers and attention layers, CoAtNet harnesses both the strong inductive biases of ConvNets in early stages and the high-capacity, global receptive fields offered by Transformers in later stages.

Architecture and Design

Merging Convolution and Self-Attention

The hybrid computational block integrates depthwise convolution into self-attention by incorporating relative positional encodings, forming a scalar that merges static convolution kernels with dynamic attention weights. This design facilitates the adaptability of self-attention to capture high-level relational interactions while leveraging convolution's generalization benefits, particularly under data constraints.

Vertical Layout Design

The paper examines various vertical stacking strategies, culminating in a multi-stage network named CoAtNet. The architecture includes an initial convolutional stem (S0), followed by increasingly deeper convolutional and attention layers distributed across five stages (S0 to S4). Key findings highlight that convolution layers effectively process local patterns at initial stages, while self-attention layers scale efficiently for capturing global contexts in deeper stages.

Empirical Performance and Comparisons

The constructed CoAtNet models demonstrate SOTA accuracies across different benchmarks:

ImageNet-1K: CoAtNet achieves up to 86.0% top-1 accuracy, surpassing previous benchmarks set by NFNet and ViT-based models.
ImageNet-21K: CoAtNet attains 88.56% top-1 accuracy, doubling ViT’s efficiency by requiring 23x less training data.
JFT-300M and JFT-3B: CoAtNet models maintain superior performance metrics, achieving 89.77% and then 90.88% top-1 accuracy with significantly lesser computational resources compared to ViT-G/14.

Implications and Future Directions

The results underscore the advantages of balancing convolutional inductive biases with the flexible capacity of transformers. This hybrid approach not only scales effectively with data but also generalizes robustly, indicating a promising direction for future AI research in neural network architecture.

Future work could extend CoAtNet’s application to broader domains such as object detection and semantic segmentation. Exploring further optimizations within each hybrid block and vertical stage can also unveil additional performance improvements, potentially setting new benchmarks in the field.

In conclusion, CoAtNet offers a compelling framework for integrating the best of both convolutional and transformer worlds, achieving high accuracy with enhanced efficiency, and setting a new paradigm for hybrid neural network architectures.

Markdown Report Issue