ConvNets Match Vision Transformers at Scale

Published 25 Oct 2023 in cs.CV, cs.LG, and cs.NE | (2310.16764v1)

Abstract: Many researchers believe that ConvNets perform well on small or moderately sized datasets, but are not competitive with Vision Transformers when given access to datasets on the web-scale. We challenge this belief by evaluating a performant ConvNet architecture pre-trained on JFT-4B, a large labelled dataset of images often used for training foundation models. We consider pre-training compute budgets between 0.4k and 110k TPU-v4 core compute hours, and train a series of networks of increasing depth and width from the NFNet model family. We observe a log-log scaling law between held out loss and compute budget. After fine-tuning on ImageNet, NFNets match the reported performance of Vision Transformers with comparable compute budgets. Our strongest fine-tuned model achieves a Top-1 accuracy of 90.4%.

Abstract PDF Upgrade to Chat

Citations (16)

View on Semantic Scholar

Summary

The paper demonstrates that ConvNets, when scaled with ample compute and data, can match the performance of Vision Transformers in vision tasks.
It employs NFNet architectures pre-trained on the JFT-4B dataset and evaluates Top-1 accuracy to benchmark ConvNet effectiveness.
The empirical results challenge the common bias towards transformers, urging a re-assessment of model choices for large-scale applications.

ConvNets Match Vision Transformers at Scale

The paper "ConvNets Match Vision Transformers at Scale," authored by Samuel L. Smith, Andrew Brock, Leonard Berrada, and Soham De from Google DeepMind, critically examines the prevalent assertion in computer vision that Vision Transformers (ViTs) surpass Convolutional Neural Networks (ConvNets) at scale. This study explores an extensive empirical comparison between a ConvNet architecture and ViTs, utilizing substantial computational resources for pre-training on a large-scale dataset.

Introduction

ConvNets have been foundational in the field of computer vision, achieving early successes and dominating benchmarks for nearly a decade. However, with the advent of ViT, there has been a shift towards transformer-based architectures for image recognition tasks. The current consensus suggests that ViTs exhibit superior scaling properties when trained on substantial datasets collected from the web. This paper challenges this prevailing view by rigorously evaluating the NFNet model family, a state-of-the-art ConvNet architecture, pre-trained on the JFT-4B dataset to compare performance against ViTs under equivalent computational budgets.

Methodology

The NFNet models, including various configurations from F0 to F7+, were pre-trained on the JFT-4B dataset, encompassing around 4 billion labeled images. The pre-training was carried out across various compute budgets extending from 0.4k to 110k TPU-v4 core hours. The training methodology adhered to established practices using SGD with Momentum, Adaptive Gradient Clipping (AGC), and distinct image resolutions during training and evaluation.

A log-log scaling law between held-out validation loss and compute budget was prominent, akin to trends observed in language modeling with transformers. Furthermore, the study explored optimal epoch budgets and learning rates across varying model sizes, adhering to the principle that both should be scaled proportionally with compute budgets, resonating with earlier findings in language modeling.

Results

Upon fine-tuning the pre-trained NFNets on ImageNet, the performance metrics revealed that NFNets were on par with ViT counterparts in terms of Top-1 accuracy. Notably, the NFNet-F7+ model achieved a Top-1 accuracy of 90.4% after fine-tuning with repeated augmentation, demonstrating significant improvements over previous NFNet benchmarks without additional data. This parity in performance was maintained even when considering comparable compute budgets, highlighting the efficacy of ConvNets at scale.

Analysis

The results underscore that, with sufficient computational resources and dataset sizes, ConvNets can match the performance of ViTs. This finding contests the prevailing assumption that ViTs inherently possess superior scaling properties. The linear trend observed in the log-log scaling between validation loss and compute budget prompts a re-evaluation of current biases towards transformer architectures.

Discussion

The study's implications are twofold. Practically, it suggests that researchers can still rely on ConvNets for competitive performance in large-scale vision tasks, provided adequate compute and data. Theoretically, it challenges the narrative favoring transformers, advocating for a more nuanced view that considers the critical role of compute and data irrespective of the model architecture. Future developments in AI will likely explore hybrid models, leveraging the strengths of both ConvNets and transformers.

Conclusion

In conclusion, the paper robustly demonstrates that ConvNets can indeed match the performance of ViTs at scale, challenging the orthodoxy in contemporary computer vision research. By providing rigorous empirical evidence, the study invites the research community to re-assess the comparative advantages of these architectures, potentially fostering a more balanced approach to developing future AI models.