ConvNet vs Transformer, Supervised vs CLIP: Beyond ImageNet Accuracy

Published 15 Nov 2023 in cs.CV and cs.LG | (2311.09215v3)

Abstract: Modern computer vision offers a great variety of models to practitioners, and selecting a model from multiple options for specific applications can be challenging. Conventionally, competing model architectures and training protocols are compared by their classification accuracy on ImageNet. However, this single metric does not fully capture performance nuances critical for specialized tasks. In this work, we conduct an in-depth comparative analysis of model behaviors beyond ImageNet accuracy, for both ConvNet and Vision Transformer architectures, each across supervised and CLIP training paradigms. Although our selected models have similar ImageNet accuracies and compute requirements, we find that they differ in many other aspects: types of mistakes, output calibration, transferability, and feature invariance, among others. This diversity in model characteristics, not captured by traditional metrics, highlights the need for more nuanced analysis when choosing among different models. Our code is available at https://github.com/kirill-vish/Beyond-INet.

Abstract PDF HTML Upgrade to Chat

Authors (3)

References (64)

Citations (11)

View on Semantic Scholar

Summary

The paper challenges ImageNet-centric evaluation by demonstrating that similar accuracy metrics conceal substantial differences in error types, calibration, and transferability.
The study reveals that CLIP models, benefiting from broader training data, show enhanced robustness and shape bias compared to their supervised counterparts.
Using detailed comparisons of ConvNeXt and ViT architectures, the research advocates for richer evaluation metrics suited to diverse, application-specific scenarios.

ConvNet vs Transformer, Supervised vs CLIP: Beyond ImageNet Accuracy

Vishniakov et al. address the inadequacies of using ImageNet accuracy as the sole metric for evaluating computer vision models. The study critically analyzes ConvNet and Vision Transformer (ViT) architectures across supervised and CLIP training paradigms, revealing that models with similar ImageNet accuracies exhibit substantial differences in characteristics and behaviors. This research underscores the necessity for nuanced evaluation metrics to inform model selection tailored to specific application contexts.

The authors present a meticulous comparative study involving ConvNeXt and ViT models, each trained in supervised and CLIP settings. Although these models possess nearly identical ImageNet-1K validation accuracies, their behaviors diverge significantly across various dimensions, such as error types, calibration, transferability, and feature invariance. This diversity cannot be captured through traditional benchmarks alone, suggesting that relying solely on ImageNet may obscure critical differences relevant to practical implementation.

Key findings highlight that CLIP models demonstrate remarkable robustness and transferability. Despite similar ImageNet accuracies between CLIP and supervised models, CLIP's vision encoder shows significantly improved performance on diverse datasets, attributed to its broader data diversity during training. For instance, CLIP is noted for fewer classification errors and better shape bias compared to its supervised counterparts. Yet, supervised models showcase superior calibration and outperform CLIP models on various robustness benchmarks closely tied to ImageNet.

ConvNeXt emerges as a robust contender within this study. The supervised ConvNeXt displays superior transferability, closely rivaling CLIP models, and achieves notable performance across robustness and synthetic benchmarks, revealing its favorable calibration and invariance characteristics. These observations suggest ConvNeXt's potential for tasks closely resembling ImageNet distributions, whereas CLIP models offer advantages in generalized domain shifts.

This work sets a compelling precedent for future research, advocating for alternative evaluation paradigms extending beyond ImageNet. The nuanced insights into model behaviors emphasize the importance of rethinking conventional benchmarks, urging the creation of more representative datasets for real-world applications. Consequent directions might explore scalable architectures with richer feature representations, further disambiguating model performance based on contextual metrics beyond mere classification accuracy.

In conclusion, Vishniakov et al.'s analysis challenges the status quo of ImageNet-centric evaluation, proposing a broader, context-aware approach to model assessment. Their findings invite researchers to deliberate on tailored model choices, fostering innovation in computer vision modeling and expanding the potential utility across diverse application horizons.

Markdown Report Issue