Papers
Topics
Authors
Recent
Search
2000 character limit reached

Learning the Unlearned: Mitigating Feature Suppression in Contrastive Learning

Published 19 Feb 2024 in cs.CV and cs.LG | (2402.11816v3)

Abstract: Self-Supervised Contrastive Learning has proven effective in deriving high-quality representations from unlabeled data. However, a major challenge that hinders both unimodal and multimodal contrastive learning is feature suppression, a phenomenon where the trained model captures only a limited portion of the information from the input data while overlooking other potentially valuable content. This issue often leads to indistinguishable representations for visually similar but semantically different inputs, adversely affecting downstream task performance, particularly those requiring rigorous semantic comprehension. To address this challenge, we propose a novel model-agnostic Multistage Contrastive Learning (MCL) framework. Unlike standard contrastive learning which inherently captures one single biased feature distribution, MCL progressively learns previously unlearned features through feature-aware negative sampling at each stage, where the negative samples of an anchor are exclusively selected from the cluster it was assigned to in preceding stages. Meanwhile, MCL preserves the previously well-learned features by cross-stage representation integration, integrating features across all stages to form final representations. Our comprehensive evaluation demonstrates MCL's effectiveness and superiority across both unimodal and multimodal contrastive learning, spanning a range of model architectures from ResNet to Vision Transformers (ViT). Remarkably, in tasks where the original CLIP model has shown limitations, MCL dramatically enhances performance, with improvements up to threefold on specific attributes in the recently proposed MMVP benchmark.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. The hidden uniform cluster prior in self-supervised learning. arXiv preprint arXiv:2210.07277, 2022.
  2. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013.
  3. Reducing predictive feature suppression in resource-constrained contrastive image-caption retrieval. Transactions on Machine Learning Research, 2023.
  4. Unsupervised learning of visual features by contrasting cluster assignments. Advances in neural information processing systems, 33:9912–9924, 2020.
  5. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pp.  1597–1607. PMLR, 2020.
  6. Intriguing properties of contrastive losses. Advances in Neural Information Processing Systems, 34:11834–11845, 2021.
  7. Exploring simple siamese representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  15750–15758, 2021.
  8. Learning robust representations via multi-view information bottleneck. arXiv preprint arXiv:2002.07017, 2020.
  9. Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems, 33:21271–21284, 2020.
  10. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  770–778, 2016.
  11. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  9729–9738, 2020.
  12. What shapes feature representations? exploring datasets, architectures, and training. Advances in Neural Information Processing Systems, 33:9995–10006, 2020.
  13. A survey on contrastive self-supervised learning. Technologies, 9(1):2, 2020.
  14. Temperature schedules for self-supervised contrastive methods on long-tail data. arXiv preprint arXiv:2303.13664, 2023.
  15. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023.
  16. Self-supervised learning: Generative or contrastive. IEEE transactions on knowledge and data engineering, 35(1):857–876, 2021.
  17. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  18. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.  8748–8763. PMLR, 2021.
  19. Zero-shot text-to-image generation. In International Conference on Machine Learning, pp.  8821–8831. PMLR, 2021.
  20. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
  21. Can contrastive learning avoid shortcut solutions? Advances in neural information processing systems, 34:4974–4986, 2021.
  22. Everything at once-multi-modal fusion transformer for video retrieval. In Proceedings of the ieee/cvf conference on computer vision and pattern recognition, pp.  20020–20029, 2022.
  23. Feature dropout: Revisiting the role of augmentations in contrastive learning. arXiv preprint arXiv:2212.08378, 2022.
  24. Self-supervised learning from a multi-view perspective. arXiv preprint arXiv:2006.05576, 2020.
  25. Unsupervised semantic segmentation by contrasting object mask proposals. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  10052–10062, 2021.
  26. What should not be contrastive in contrastive learning. arXiv preprint arXiv:2008.05659, 2020.
  27. Which features are learnt by contrastive learning? on the role of simplicity bias in class collapse and feature suppression. arXiv preprint arXiv:2305.16536, 2023.
  28. A survey on multimodal large language models. arXiv preprint arXiv:2306.13549, 2023.
  29. Ts2vec: Towards universal representation of time series. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp.  8980–8987, 2022.
  30. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
Citations (1)

Summary

  • The paper introduces a Multistage Contrastive Learning (MCL) framework that mitigates feature suppression through feature-aware negative sampling and cross-stage representation integration.
  • It demonstrates significant accuracy improvements in unimodal tasks (e.g., CIFAR feature recognition from 0.10 to 0.87) and enhanced performance in multimodal models such as CLIP.
  • The methodology adapts flexibly to various parameters, offering a promising strategy to enrich learning representations with previously unlearned features.

Learning the Unlearned: Mitigating Feature Suppression in Contrastive Learning

Introduction

The paper "Learning the Unlearned: Mitigating Feature Suppression in Contrastive Learning" introduces a Multistage Contrastive Learning (MCL) framework to tackle the problem of feature suppression in contrastive learning models such as SimCLR and CLIP (2402.11816). Feature suppression refers to the tendency of contrastive learning models to capture limited information, which adversely affects downstream task performance. The proposed MCL framework progressively learns unlearned features and preserves well-learned features by utilizing feature-aware negative sampling across multiple stages. Figure 1

Figure 1

Figure 1: Images from Trifeature with different shapes and textures have high similarity in the SimCLR space.

Multistage Contrastive Learning Framework

Feature Suppression Challenge

Contrastive learning maximizes the similarity between an anchor and its positive samples while increasing separation among dissimilar samples [jaiswal2020survey]. However, recent studies have identified that standard contrastive learning often misses substantial portions of input information, causing feature suppression [robinson2021can]. For example, in unimodal settings, the SimCLR model fails to differentiate between different shapes and textures, resulting in visually indistinguishable representations (Figure 1). In multimodal settings, CLIP encounters similar issues with orientation and direction, leading to systematic failures in distinguishing between semantically different images.

Feature-aware Negative Sampling

To mitigate feature suppression, MCL introduces feature-aware negative sampling, where negative samples are selected from clusters assigned in preceding stages (Figure 2). At each stage, the model learns new features distinct from those already learned by ensuring that the anchor and its negative samples have identical feature-aware pseudo labels. This approach forces the model to explore previously unlearned features, progressively enhancing the learned representations. Figure 2

Figure 2: Overview of the Multistage Contrastive Learning (MCL) Framework. The final representations are the concatenation of representations from each stage.

Cross-stage Representation Integration

After multistage training, MCL employs cross-stage representation integration, where representations from all stages are concatenated to preserve well-learned features. This concatenated representation ensures comprehensive feature retention and improves downstream task performance by maintaining information from each learned stage.

Experiments and Results

Unimodal Settings

In experiments on unobvious datasets such as Trifeature and CIFAR-MNIST, MCL demonstrated significant improvement over baseline models (Table 1). For instance, in CIFAR-MNIST, MCL improved CIFAR feature recognition accuracy from 0.10 to 0.87, showcasing its efficacy in reducing feature suppression.

Multimodal Settings

Testing MCL-enhanced CLIP models on the MMVP benchmark revealed improved performance across multiple image-text pairing tasks and attributes (Table 2). MCL improved average accuracy from 20.0 to 32.6 in ViT models, validating its practicality and effectiveness in multimodal contrastive learning environments. Figure 3

Figure 3

Figure 3: In each stage, the top 3 most similar samples to the anchor, demonstrating the model's shifting focus from color, texture, to shape across different stages of MCL.

Dynamic Aspect Analysis

In analyzing the impact of training stages (N), the number of clusters (K) in K-means, and temperature (Ï„\tau) settings, MCL consistently showed better adaptability and robustness across diverse configurations (Table 3). This adaptability underscores its broad applicability and potential to integrate into existing contrastive learning frameworks without requiring substantial computational changes. Figure 4

Figure 4

Figure 4

Figure 4: Linear evaluation accuracy for each stage individually before cross-stage representation integration.

Conclusion

The proposed Multistage Contrastive Learning framework successfully addresses feature suppression in contrastive learning models by facilitating the acquisition of previously unlearned features through feature-aware negative sampling and cross-stage representation integration. MCL demonstrates robust performance improvements across both unimodal and multimodal settings, presenting a promising direction for enhancing contrastive learning methodologies and their downstream applications in AI. Future work may explore advanced cross-stage integration techniques and extend MCL's applicability to larger-scale models like CLIP and other architectures within the AI domain.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.