Papers
Topics
Authors
Recent
Search
2000 character limit reached

ConvNet vs Transformer, Supervised vs CLIP: Beyond ImageNet Accuracy

Published 15 Nov 2023 in cs.CV and cs.LG | (2311.09215v3)

Abstract: Modern computer vision offers a great variety of models to practitioners, and selecting a model from multiple options for specific applications can be challenging. Conventionally, competing model architectures and training protocols are compared by their classification accuracy on ImageNet. However, this single metric does not fully capture performance nuances critical for specialized tasks. In this work, we conduct an in-depth comparative analysis of model behaviors beyond ImageNet accuracy, for both ConvNet and Vision Transformer architectures, each across supervised and CLIP training paradigms. Although our selected models have similar ImageNet accuracies and compute requirements, we find that they differ in many other aspects: types of mistakes, output calibration, transferability, and feature invariance, among others. This diversity in model characteristics, not captured by traditional metrics, highlights the need for more nuanced analysis when choosing among different models. Our code is available at https://github.com/kirill-vish/Beyond-INet.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (64)
  1. Synthetic data from diffusion models improves imagenet classification. arXiv preprint arXiv:2304.08466, 2023.
  2. Why do deep convolutional networks generalize so poorly to small image transformations? arXiv preprint arXiv:1805.12177, 2018.
  3. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
  4. Are transformers more robust than cnns? In NeurIPS, 2021.
  5. Are we done with imagenet? arXiv preprint arXiv:2006.07159, 2020.
  6. Pug: Photorealistic and semantically controllable synthetic data for representation learning. arXiv preprint arXiv:2308.03977, 2023.
  7. A simple framework for contrastive learning of visual representations. In ICML, 2020.
  8. Scaling vision transformers to 22 billion parameters. In ICML, 2023.
  9. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
  10. On robustness and transferability of convolutional neural networks. In CVPR, 2021.
  11. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  12. Does progress on imagenet transfer to real-world datasets? arXiv preprint arXiv:2301.04644, 2023.
  13. Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. arXiv preprint arXiv:1811.12231, 2018.
  14. Shortcut learning in deep neural networks. Nature Machine Intelligence, 2020.
  15. Do self-supervised and supervised methods learn similar visual representations? arXiv preprint arXiv:2110.00528, 2021.
  16. On calibration of modern neural networks. In ICML, 2017.
  17. Beyond supervised vs. unsupervised: Representative benchmarking and analysis of image representation learning. In CVPR, 2022.
  18. Deep residual learning for image recognition. In CVPR, 2016.
  19. Momentum contrast for unsupervised visual representation learning. In CVPR, 2020.
  20. Masked autoencoders are scalable vision learners. In CVPR, 2022a.
  21. Is synthetic data from generative models ready for image recognition? arXiv preprint arXiv:2210.07574, 2022b.
  22. Benchmarking neural network robustness to common corruptions and perturbations. arXiv preprint arXiv:1903.12261, 2019.
  23. The many faces of robustness: A critical analysis of out-of-distribution generalization. In CVPR, 2021a.
  24. Natural adversarial examples. In CVPR, 2021b.
  25. Imagenet-x: Understanding model mistakes with factor of variation annotations. arXiv preprint arXiv:2211.01866, 2022.
  26. Openclip, 2021.
  27. Big transfer (bit): General visual representation learning. In ECCV, 2020.
  28. Do better imagenet models transfer better? In CVPR, 2019.
  29. Imagenet classification with deep convolutional neural networks. 2012.
  30. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 1998.
  31. A convnet for the 2020s. In CVPR, 2022.
  32. Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning. Neurocomputing, 2022.
  33. Imagenet suffers from dichotomous data difficulty. In NeurIPS 2021 Workshop on ImageNet: Past, Present, and Future, 2021.
  34. Accuracy on the line: on the strong correlation between out-of-distribution and in-distribution generalization. In ICML, 2021.
  35. Revisiting the calibration of modern neural networks. NeurIPS, 2021.
  36. Intriguing properties of vision transformers. In NeurIPS, 2021.
  37. Do wide and deep networks learn the same things? uncovering how neural network representations vary with width and depth. arXiv preprint arXiv:2010.15327, 2020.
  38. What do self-supervised vision transformers learn? arXiv preprint arXiv:2305.00729, 2023.
  39. An impartial take to the cnn vs transformer robustness contest. In ECCV, 2022.
  40. Learning transferable visual models from natural language supervision. In ICML, 2021.
  41. Do vision transformers see like convolutional neural networks? NeurIPS, 2021.
  42. On the connection between pre-training data diversity and fine-tuning robustness. arXiv preprint arXiv:2307.12532, 2023.
  43. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
  44. Do imagenet classifiers generalize to imagenet? In ICML, 2019.
  45. Does progress on object recognition benchmarks improve real-world generalization? arXiv preprint arXiv:2307.13136, 2023.
  46. Finding differences between transformers and convnets using counterfactual simulation testing. NeurIPS, 2022.
  47. Objectives matter: Understanding the impact of self-supervised objectives on vision transformer representations. arXiv preprint arXiv:2304.13089, 2023.
  48. Imagenet-hard: The hardest images remaining from a study of the power of zoom and spatial biases in image classification.
  49. Stablerep: Synthetic images from text-to-image models make strong visual representation learners. arXiv preprint arXiv:2306.00984, 2023.
  50. Training data-efficient image transformers & distillation through attention. In ICML, 2021.
  51. Deit iii: Revenge of the vit. In ECCV, 2022.
  52. From ImageNet to image classification: Contextualizing progress on benchmarks. In ICML, 2020.
  53. Are convolutional neural networks or transformers more like human vision? arXiv preprint arXiv:2105.07197, 2021.
  54. Clipascene: Scene sketching with different types and levels of abstraction. In ICCV, 2023.
  55. Teaching matters: Investigating the role of supervision in vision transformers. In CVPR, 2023.
  56. Learning robust global representations by penalizing local predictive power. NeurIPS, 2019.
  57. Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452, 2022.
  58. Convnext v2: Co-designing and scaling convnets with masked autoencoders. In CVPR, 2023.
  59. Robust fine-tuning of zero-shot models. In CVPR, 2022.
  60. Does robustness on imagenet transfer to downstream tasks? In CVPR, 2022.
  61. Re-labeling imagenet: from single to multi-labels, from global to localized labels. In CVPR, 2021.
  62. A large-scale study of representation learning with the visual task adaptation benchmark. arXiv preprint arXiv:1910.04867, 2019.
  63. Richard Zhang. Making convolutional networks shift-invariant again. In ICML, 2019.
  64. Convnets vs. transformers: Whose visual representations are more transferable? In CVPR, 2021.
Citations (11)

Summary

  • The paper challenges ImageNet-centric evaluation by demonstrating that similar accuracy metrics conceal substantial differences in error types, calibration, and transferability.
  • The study reveals that CLIP models, benefiting from broader training data, show enhanced robustness and shape bias compared to their supervised counterparts.
  • Using detailed comparisons of ConvNeXt and ViT architectures, the research advocates for richer evaluation metrics suited to diverse, application-specific scenarios.

ConvNet vs Transformer, Supervised vs CLIP: Beyond ImageNet Accuracy

Vishniakov et al. address the inadequacies of using ImageNet accuracy as the sole metric for evaluating computer vision models. The study critically analyzes ConvNet and Vision Transformer (ViT) architectures across supervised and CLIP training paradigms, revealing that models with similar ImageNet accuracies exhibit substantial differences in characteristics and behaviors. This research underscores the necessity for nuanced evaluation metrics to inform model selection tailored to specific application contexts.

The authors present a meticulous comparative study involving ConvNeXt and ViT models, each trained in supervised and CLIP settings. Although these models possess nearly identical ImageNet-1K validation accuracies, their behaviors diverge significantly across various dimensions, such as error types, calibration, transferability, and feature invariance. This diversity cannot be captured through traditional benchmarks alone, suggesting that relying solely on ImageNet may obscure critical differences relevant to practical implementation.

Key findings highlight that CLIP models demonstrate remarkable robustness and transferability. Despite similar ImageNet accuracies between CLIP and supervised models, CLIP's vision encoder shows significantly improved performance on diverse datasets, attributed to its broader data diversity during training. For instance, CLIP is noted for fewer classification errors and better shape bias compared to its supervised counterparts. Yet, supervised models showcase superior calibration and outperform CLIP models on various robustness benchmarks closely tied to ImageNet.

ConvNeXt emerges as a robust contender within this study. The supervised ConvNeXt displays superior transferability, closely rivaling CLIP models, and achieves notable performance across robustness and synthetic benchmarks, revealing its favorable calibration and invariance characteristics. These observations suggest ConvNeXt's potential for tasks closely resembling ImageNet distributions, whereas CLIP models offer advantages in generalized domain shifts.

This work sets a compelling precedent for future research, advocating for alternative evaluation paradigms extending beyond ImageNet. The nuanced insights into model behaviors emphasize the importance of rethinking conventional benchmarks, urging the creation of more representative datasets for real-world applications. Consequent directions might explore scalable architectures with richer feature representations, further disambiguating model performance based on contextual metrics beyond mere classification accuracy.

In conclusion, Vishniakov et al.'s analysis challenges the status quo of ImageNet-centric evaluation, proposing a broader, context-aware approach to model assessment. Their findings invite researchers to deliberate on tailored model choices, fostering innovation in computer vision modeling and expanding the potential utility across diverse application horizons.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 6 tweets with 471 likes about this paper.

HackerNews