When Does Contrastive Visual Representation Learning Work?

Published 12 May 2021 in cs.CV and cs.LG | (2105.05837v2)

Abstract: Recent self-supervised representation learning techniques have largely closed the gap between supervised and unsupervised learning on ImageNet classification. While the particulars of pretraining on ImageNet are now relatively well understood, the field still lacks widely accepted best practices for replicating this success on other datasets. As a first step in this direction, we study contrastive self-supervised learning on four diverse large-scale datasets. By looking through the lenses of data quantity, data domain, data quality, and task granularity, we provide new insights into the necessary conditions for successful self-supervised learning. Our key findings include observations such as: (i) the benefit of additional pretraining data beyond 500k images is modest, (ii) adding pretraining images from another domain does not lead to more general representations, (iii) corrupted pretraining images have a disparate impact on supervised and self-supervised pretraining, and (iv) contrastive learning lags far behind supervised learning on fine-grained visual classification tasks.

Abstract PDF Upgrade to Chat

Citations (118)

View on Semantic Scholar

Summary

The paper shows that increasing pretraining data beyond 500k images provides only modest performance gains.
The paper demonstrates that domain-specific pretraining significantly outperforms models trained on diverse, pooled datasets.
The paper reveals that contrastive learning is sensitive to resolution degradation and underperforms in fine-grained classification tasks.

Insights into Contrastive Visual Representation Learning

The paper "When Does Contrastive Visual Representation Learning Work?" provides a comprehensive analysis of the conditions necessary for the successful application of contrastive self-supervised learning (SSL) techniques in visual representation learning. By exploring diverse dataset properties and pretraining conditions, the researchers aim to understand how existing SSL methods can replicate their success on datasets other than the standard ImageNet dataset.

Key Findings

1. Data Quantity:

The research finds that for datasets with an ImageNet-like scale, using more than 500k images for pretraining yields only modest benefits in performance. Specifically, reducing the pretraining set size from 1M to 500k images results in only a 1-2% drop in classification performance.
In scenarios with limited labeled data, self-supervised representations serve as superior initializers compared to models trained from scratch. However, when large amounts of labeled data are available, the performance gap between self-supervised and fully supervised models narrows.

2. Domain Specificity:

The study reveals that contrastive learning benefits significantly from domain-specific pretraining. Models trained on data from the same domain as the downstream tasks perform notably better than those pretrained on different domains.
Surprisingly, increasing the diversity of pretraining datasets by incorporating images from different domains does not enhance model performance, indicating the lack of generality in current SSL methods trained with merely pooled datasets.

3. Data Quality:

Pretraining on corrupted images affects SSL performance considerably, with resolution degradation (downsampling) having a profound impact. This sensitivity suggests a potential limitation in utilizing low-quality datasets for SSL.
Contrarily, high-frequency corruptions like JPEG compression or salt-and-pepper noise have a relatively minor effect on representation learning.

4. Task Granularity:

The research highlights a significant performance gap between self-supervised and supervised learning for fine-grained classification tasks. As task granularity increases, SSL representations lag behind, which indicates a potential insufficiency of the contrastive loss in capturing nuanced, fine-grained features.

Implications and Future Directions

This study underlines several pathways for advancing self-supervised learning approaches:

Optimization of Pretraining Data: Given the diminishing returns beyond 500k images, future work should focus on the quality and domain-specificity of pretraining data rather than sheer quantity. This includes curating datasets that balance between image diversity and relevance to downstream tasks.
Domain-Specific Augmentation Strategies: Current SSL methods are developed with assumptions inherent to datasets like ImageNet. Tailoring data augmentation strategies to suit different domains (such as fine-grained categories or low-quality images) could bridge the performance gap observed in non-standard tasks.
Robustness and Generalization Improvements: There is an evident need for techniques that enhance SSL robustness to image quality variations and enable models to generalize across distinct domains without retraining.
Exploration of New SSL Frameworks: The observed limitations in task granularity suggest potential directions for innovating beyond contrastive frameworks, possibly by integrating additional learning signals or losses that encourage fine-grained feature learning.

In conclusion, this paper provides critical insights into optimizing SSL approaches across various datasets and task requirements. It challenges the research community to rethink current SSL strategies in favor of more adaptable and robust frameworks that extend into broader application domains.

Markdown Report Issue