Papers
Topics
Authors
Recent
Search
2000 character limit reached

Understanding the Robustness of Multi-modal Contrastive Learning to Distribution Shift

Published 8 Oct 2023 in cs.LG | (2310.04971v2)

Abstract: Recently, multimodal contrastive learning (MMCL) approaches, such as CLIP, have achieved a remarkable success in learning representations that are robust against distribution shift and generalize to new domains. Despite the empirical success, the mechanism behind learning such generalizable representations is not understood. In this work, we rigorously analyze this problem and uncover two mechanisms behind MMCL's robustness: \emph{intra-class contrasting}, which allows the model to learn features with a high variance, and \emph{inter-class feature sharing}, where annotated details in one class help learning other classes better. Both mechanisms prevent spurious features that are over-represented in the training data to overshadow the generalizable core features. This yields superior zero-shot classification accuracy under distribution shift. Furthermore, we theoretically demonstrate the benefits of using rich captions on robustness and explore the effect of annotating different types of details in the captions. We validate our theoretical findings through experiments, including a well-designed synthetic experiment and an experiment involving training CLIP models on MSCOCO/Conceptual Captions and evaluating them on shifted ImageNets.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (69)
  1. A theoretical analysis of contrastive unsupervised representation learning. arXiv preprint arXiv:1902.09229, 2019.
  2. Linear unit-tests for invariance discovery. arXiv preprint arXiv:2102.10867, 2021.
  3. Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models. Advances in neural information processing systems, 32, 2019.
  4. A theory of label propagation for subpopulation shift. In International Conference on Machine Learning, pp.  1170–1182. PMLR, 2021.
  5. A Simple Framework for Contrastive Learning of Visual Representations. February 2020. doi: 10.48550/arXiv.2002.05709. URL https://arxiv.org/abs/2002.05709v3.
  6. Debiased Contrastive Learning. In Advances in Neural Information Processing Systems, volume 33, pp.  8765–8775. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/hash/63c3ddcc7b23daa1e42dc41f9a44a873-Abstract.html.
  7. Li Deng. The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29(6):141–142, 2012.
  8. Learning rate schedules in the presence of distribution shift. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp.  9523–9546. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/fahrbach23a.html.
  9. Data determines distributional robustness in contrastive language image pre-training (clip). In International Conference on Machine Learning, pp.  6216–6234. PMLR, 2022.
  10. Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. arXiv preprint arXiv:1811.12231, 2018.
  11. Cyclip: Cyclic contrastive language-image pretraining. Advances in Neural Information Processing Systems, 35:6704–6719, 2022.
  12. In search of lost domain generalization. arXiv preprint arXiv:2007.01434, 2020.
  13. Provable guarantees for self-supervised deep learning with spectral contrastive loss. Advances in Neural Information Processing Systems, 34:5000–5011, 2021.
  14. The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  8340–8349, 2021a.
  15. Natural adversarial examples. CVPR, 2021b.
  16. Open graph benchmark: Datasets for machine learning on graphs. Advances in neural information processing systems, 33:22118–22133, 2020.
  17. Adversarial examples are not bugs, they are features. Advances in neural information processing systems, 32, 2019.
  18. The power of contrast for feature learning: A theoretical analysis. arXiv preprint arXiv:2110.02473, 2021.
  19. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pp.  4904–4916. PMLR, 2021.
  20. Understanding dimensional collapse in contrastive self-supervised learning. arXiv preprint arXiv:2110.09348, 2021.
  21. Towards mitigating spurious correlations in the wild: A benchmark and a more realistic dataset, 2023.
  22. Supervised contrastive learning. Advances in Neural Information Processing Systems, 33:18661–18673, 2020.
  23. Wilds: A benchmark of in-the-wild distribution shifts. In International Conference on Machine Learning, pp.  5637–5664. PMLR, 2021.
  24. Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557, 2019.
  25. Oscar: Object-semantics aligned pre-training for vision-language tasks. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16, pp.  121–137. Springer, 2020.
  26. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp.  740–755. Springer, 2014.
  27. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems, 32, 2019.
  28. Stephen J Montgomery-Smith. The distribution of rademacher sums. Proceedings of the American Mathematical Society, 109(2):517–522, 1990.
  29. Slip: Self-supervision meets language-image pre-training. In European Conference on Computer Vision, pp.  529–544. Springer, 2022.
  30. Understanding multimodal contrastive learning and incorporating unpaired data. arXiv preprint arXiv:2302.06232, 2023.
  31. Understanding guided image captioning performance across domains. arXiv preprint arXiv:2012.02339, 2020.
  32. Multimodal deep learning. In Proceedings of the 28th international conference on machine learning (ICML-11), pp.  689–696, 2011.
  33. Improving multimodal datasets with image captioning. arXiv preprint arXiv:2307.10350, 2023.
  34. Combined scaling for zero-shot transfer learning. Neurocomputing, 555:126658, 2023.
  35. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.  8748–8763. PMLR, 2021.
  36. Do imagenet classifiers generalize to imagenet? In International conference on machine learning, pp.  5389–5400. PMLR, 2019.
  37. On the importance of contrastive loss in multimodal learning. arXiv preprint arXiv:2304.03717, 2023.
  38. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015. doi: 10.1007/s11263-015-0816-y.
  39. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. arXiv preprint arXiv:1911.08731, 2019.
  40. An investigation of why overparameterization exacerbates spurious correlations. In International Conference on Machine Learning, pp.  8346–8356. PMLR, 2020.
  41. Breeds: Benchmarks for subpopulation shift. arXiv preprint arXiv:2008.04859, 2020.
  42. Is a caption worth a thousand images? a controlled study for representation learning. arXiv preprint arXiv:2207.07635, 2022.
  43. Understanding contrastive learning requires incorporating inductive biases. arXiv preprint arXiv:2202.14037, 2022.
  44. Do image classifiers generalize across time? In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  9661–9669, 2021.
  45. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  2556–2565, 2018.
  46. Galen R Shorack and GR Shorack. Probability for statisticians, volume 951. Springer, 2000.
  47. The implicit bias of gradient descent on separable data. The Journal of Machine Learning Research, 19(1):2822–2878, 2018.
  48. Multimodal learning with deep boltzmann machines. Advances in neural information processing systems, 25, 2012.
  49. Lxmert: Learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490, 2019.
  50. Measuring robustness to natural distribution shifts in image classification. Advances in Neural Information Processing Systems, 33:18583–18599, 2020.
  51. Yuandong Tian. Understanding deep contrastive learning via coordinate-wise optimization. Advances in Neural Information Processing Systems, 35:19511–19522, 2022.
  52. Understanding self-supervised learning dynamics without contrastive pairs. In International Conference on Machine Learning, pp.  10268–10278. PMLR, 2021.
  53. Contrastive estimation reveals topic posterior information to linear models. J. Mach. Learn. Res., 22:281–1, 2021a.
  54. Contrastive learning, multi-view redundancy, and linear models. In Algorithmic Learning Theory, pp.  1179–1206. PMLR, 2021b.
  55. Martin J Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge university press, 2019.
  56. On calibration and out-of-domain generalization. Advances in neural information processing systems, 34:2215–2227, 2021.
  57. Learning robust global representations by penalizing local predictive power. In Advances in Neural Information Processing Systems, pp.  10506–10518, 2019.
  58. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In International Conference on Machine Learning, pp.  9929–9939. PMLR, 2020.
  59. Per-Åke Wedin. Perturbation bounds in connection with singular value decomposition. BIT Numerical Mathematics, 12:99–111, 1972.
  60. Toward understanding the feature learning process of self-supervised contrastive learning. In International Conference on Machine Learning, pp.  11112–11122. PMLR, 2021.
  61. Sparse topic modeling: Computational efficiency, near-optimal algorithms, and statistical inference. Journal of the American Statistical Association, pp.  1–13, 2022.
  62. Noise or signal: The role of image backgrounds in object recognition. arXiv preprint arXiv:2006.09994, 2020.
  63. Which features are learnt by contrastive learning? on the role of simplicity bias in class collapse and feature suppression. arXiv preprint arXiv:2305.16536, 2023.
  64. Change is hard: A closer look at subpopulation shift. arXiv preprint arXiv:2302.12254, 2023.
  65. Meta-learning with fewer tasks through task interpolation. arXiv preprint arXiv:2106.02695, 2021.
  66. Improving out-of-distribution robustness via selective augmentation. In International Conference on Machine Learning, pp.  25407–25437. PMLR, 2022.
  67. On the generalization of multi-modal contrastive learning. arXiv preprint arXiv:2306.04272, 2023a.
  68. Nico++: Towards better benchmarking for domain generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  16036–16047, 2023b.
  69. Object recognition with and without objects. arXiv preprint arXiv:1611.06596, 2016.
Citations (4)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 73 likes about this paper.