ConvNet vs Transformer, Supervised vs CLIP: Beyond ImageNet Accuracy
Abstract: Modern computer vision offers a great variety of models to practitioners, and selecting a model from multiple options for specific applications can be challenging. Conventionally, competing model architectures and training protocols are compared by their classification accuracy on ImageNet. However, this single metric does not fully capture performance nuances critical for specialized tasks. In this work, we conduct an in-depth comparative analysis of model behaviors beyond ImageNet accuracy, for both ConvNet and Vision Transformer architectures, each across supervised and CLIP training paradigms. Although our selected models have similar ImageNet accuracies and compute requirements, we find that they differ in many other aspects: types of mistakes, output calibration, transferability, and feature invariance, among others. This diversity in model characteristics, not captured by traditional metrics, highlights the need for more nuanced analysis when choosing among different models. Our code is available at https://github.com/kirill-vish/Beyond-INet.
- Synthetic data from diffusion models improves imagenet classification. arXiv preprint arXiv:2304.08466, 2023.
- Why do deep convolutional networks generalize so poorly to small image transformations? arXiv preprint arXiv:1805.12177, 2018.
- Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- Are transformers more robust than cnns? In NeurIPS, 2021.
- Are we done with imagenet? arXiv preprint arXiv:2006.07159, 2020.
- Pug: Photorealistic and semantically controllable synthetic data for representation learning. arXiv preprint arXiv:2308.03977, 2023.
- A simple framework for contrastive learning of visual representations. In ICML, 2020.
- Scaling vision transformers to 22 billion parameters. In ICML, 2023.
- Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
- On robustness and transferability of convolutional neural networks. In CVPR, 2021.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Does progress on imagenet transfer to real-world datasets? arXiv preprint arXiv:2301.04644, 2023.
- Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. arXiv preprint arXiv:1811.12231, 2018.
- Shortcut learning in deep neural networks. Nature Machine Intelligence, 2020.
- Do self-supervised and supervised methods learn similar visual representations? arXiv preprint arXiv:2110.00528, 2021.
- On calibration of modern neural networks. In ICML, 2017.
- Beyond supervised vs. unsupervised: Representative benchmarking and analysis of image representation learning. In CVPR, 2022.
- Deep residual learning for image recognition. In CVPR, 2016.
- Momentum contrast for unsupervised visual representation learning. In CVPR, 2020.
- Masked autoencoders are scalable vision learners. In CVPR, 2022a.
- Is synthetic data from generative models ready for image recognition? arXiv preprint arXiv:2210.07574, 2022b.
- Benchmarking neural network robustness to common corruptions and perturbations. arXiv preprint arXiv:1903.12261, 2019.
- The many faces of robustness: A critical analysis of out-of-distribution generalization. In CVPR, 2021a.
- Natural adversarial examples. In CVPR, 2021b.
- Imagenet-x: Understanding model mistakes with factor of variation annotations. arXiv preprint arXiv:2211.01866, 2022.
- Openclip, 2021.
- Big transfer (bit): General visual representation learning. In ECCV, 2020.
- Do better imagenet models transfer better? In CVPR, 2019.
- Imagenet classification with deep convolutional neural networks. 2012.
- Gradient-based learning applied to document recognition. Proceedings of the IEEE, 1998.
- A convnet for the 2020s. In CVPR, 2022.
- Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning. Neurocomputing, 2022.
- Imagenet suffers from dichotomous data difficulty. In NeurIPS 2021 Workshop on ImageNet: Past, Present, and Future, 2021.
- Accuracy on the line: on the strong correlation between out-of-distribution and in-distribution generalization. In ICML, 2021.
- Revisiting the calibration of modern neural networks. NeurIPS, 2021.
- Intriguing properties of vision transformers. In NeurIPS, 2021.
- Do wide and deep networks learn the same things? uncovering how neural network representations vary with width and depth. arXiv preprint arXiv:2010.15327, 2020.
- What do self-supervised vision transformers learn? arXiv preprint arXiv:2305.00729, 2023.
- An impartial take to the cnn vs transformer robustness contest. In ECCV, 2022.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Do vision transformers see like convolutional neural networks? NeurIPS, 2021.
- On the connection between pre-training data diversity and fine-tuning robustness. arXiv preprint arXiv:2307.12532, 2023.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
- Do imagenet classifiers generalize to imagenet? In ICML, 2019.
- Does progress on object recognition benchmarks improve real-world generalization? arXiv preprint arXiv:2307.13136, 2023.
- Finding differences between transformers and convnets using counterfactual simulation testing. NeurIPS, 2022.
- Objectives matter: Understanding the impact of self-supervised objectives on vision transformer representations. arXiv preprint arXiv:2304.13089, 2023.
- Imagenet-hard: The hardest images remaining from a study of the power of zoom and spatial biases in image classification.
- Stablerep: Synthetic images from text-to-image models make strong visual representation learners. arXiv preprint arXiv:2306.00984, 2023.
- Training data-efficient image transformers & distillation through attention. In ICML, 2021.
- Deit iii: Revenge of the vit. In ECCV, 2022.
- From ImageNet to image classification: Contextualizing progress on benchmarks. In ICML, 2020.
- Are convolutional neural networks or transformers more like human vision? arXiv preprint arXiv:2105.07197, 2021.
- Clipascene: Scene sketching with different types and levels of abstraction. In ICCV, 2023.
- Teaching matters: Investigating the role of supervision in vision transformers. In CVPR, 2023.
- Learning robust global representations by penalizing local predictive power. NeurIPS, 2019.
- Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452, 2022.
- Convnext v2: Co-designing and scaling convnets with masked autoencoders. In CVPR, 2023.
- Robust fine-tuning of zero-shot models. In CVPR, 2022.
- Does robustness on imagenet transfer to downstream tasks? In CVPR, 2022.
- Re-labeling imagenet: from single to multi-labels, from global to localized labels. In CVPR, 2021.
- A large-scale study of representation learning with the visual task adaptation benchmark. arXiv preprint arXiv:1910.04867, 2019.
- Richard Zhang. Making convolutional networks shift-invariant again. In ICML, 2019.
- Convnets vs. transformers: Whose visual representations are more transferable? In CVPR, 2021.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.