Contrastive Representation Learning: A Framework and Review

Published 10 Oct 2020 in cs.LG and stat.ML | (2010.05113v2)

Abstract: Contrastive Learning has recently received interest due to its success in self-supervised representation learning in the computer vision domain. However, the origins of Contrastive Learning date as far back as the 1990s and its development has spanned across many fields and domains including Metric Learning and natural language processing. In this paper we provide a comprehensive literature review and we propose a general Contrastive Representation Learning framework that simplifies and unifies many different contrastive learning methods. We also provide a taxonomy for each of the components of contrastive learning in order to summarise it and distinguish it from other forms of machine learning. We then discuss the inductive biases which are present in any contrastive learning system and we analyse our framework under different views from various sub-fields of Machine Learning. Examples of how contrastive learning has been applied in computer vision, natural language processing, audio processing, and others, as well as in Reinforcement Learning are also presented. Finally, we discuss the challenges and some of the most promising future research directions ahead.

Abstract PDF Upgrade to Chat

Citations (613)

View on Semantic Scholar

Summary

The paper introduces a unified CRL framework that clarifies key components like encoders, transform heads, and specialized contrastive loss functions.
It details the evolution of contrastive methods from Siamese Networks to modern approaches such as SimCLR, MoCo, and Deep Graph Infomax.
The review highlights CRL’s broad applicability in computer vision, NLP, audio, and graph data, while outlining key future research challenges.

Contrastive Representation Learning: A Comprehensive Review

Contrastive Representation Learning (CRL) is a significant topic within self-supervised learning, gaining increasing attention due to its application across various domains such as computer vision, natural language processing, and audio processing. This paper by Le-Khac et al. provides a thorough review of CRL, presenting a unified framework that simplifies and categorizes the diverse contrastive learning methods. It emphasizes the need for a cohesive understanding of the subject, bridging its use from supervised to self-supervised methods, and highlights its historical evolution and practical applications.

Framework and Components

The paper introduces a general CRL framework designed to disentangle the complexities of different contrastive methods. This framework includes:

Similarity and Dissimilarity Distributions: These distributions are crucial in generating positive and negative pairs, respectively. They dictate the invariances and covariances that the framework aims to capture in the representation.
Encoders and Transform Heads: Encoders map input data to a representation space, while transform heads further process these representations, typically projecting them into a metric space for computing similarities or distances. The framework advocates for a clear separation of these components to enhance adaptability across tasks.
Contrastive Loss Functions: CRL primarily relies on specific loss functions that enforce low similarity (or large distance) between negative pairs and high similarity (or small distance) between positive pairs. The paper discusses variations like energy-based, NCE-based, and mutual information-based losses, each with their own applicability and computational trade-offs.

Historical Context and Development

Contrastive learning traces its origins back to the 1990s, with key foundations laid by Bromley et al. through the Siamese Network in metric learning contexts. Over the years, several advancements have refined its application, including the adaptation for language representation and image similarity tasks, and its pivotal role in modern self-supervised learning paradigms.

The paper delineates the evolution of contrastive methods across various fields, emphasizing landmark methodologies such as the Instance Discrimination task, which has shown state-of-the-art results for unsupervised visual representation learning. Methods such as SimCLR and MoCo are explored for their innovative techniques in leveraging large-scale, unlabelled datasets.

Practical Implications and Applications

CRL's broad applicability spans domains including:

Vision: Techniques like SimCLR have advanced visual representation learning beyond supervised methods, encapsulating rich, general-purpose features.
Language: BERT and its derivatives have employed contrastive loss frameworks to enhance semantic understanding in NLP tasks.
Audio: From traditional waveform processing to modern speech representation learning, CRL has proven effective in encoding complex audio signals.
Graphs: Techniques such as Deep Graph Infomax illustrate CRL's capability in learning meaningful representations in relational data.

Discussion and Future Directions

The paper points out several current limitations and research opportunities in CRL:

Understanding Learned Representations: The need to clarify what makes CRL-derived embeddings more effective than those from supervised learning.
Negative Sampling: Balancing the necessity of negative samples with computational constraints, potentially exploring architectural strategies to avoid collapse without negatives.
Architectural Innovations: Evaluating the roles of projection and transform heads, with aims towards optimizing architecture for various data modalities.

CRL offers a promising shift from architecture-engineering to data-engineering, enabling scalable solutions adaptable across many contexts. Future developments should focus on refining these methodologies, addressing open questions on loss formulation and representation quality, and expanding its reach to novel, previously unaddressed problem domains.

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Glossary

Abstraction and Invariant: Properties of representations that capture abstract concepts and remain unchanged under small local input variations. "* Abstraction and Invariant: Good representations can capture more abstract concepts that are invariant to small and local changes in input data;"
AutoEncoder: A neural network trained to encode inputs and reconstruct them, often used for generative representation learning. "* AutoEncoder"
Back-translation: A data augmentation technique for text that translates a sentence to another language and back to create a semantically equivalent variant. "Fang et al. [29] transform a sentence using a back-translation method to create a slightly different sentence that has the same semantic meaning as the original one to form a positive pair."
Bayes' rule: A probabilistic principle used to compute conditional distributions from generative models. "Evaluating the conditional distribution p(y|x) for some discriminative tasks on variable y can then be obtained by using Bayes' rule."
Contrastive Learning: Learning representations by comparing pairs of inputs, pulling similar ones together and pushing dissimilar ones apart. "Contrastive Learning has recently received interest due to its success in self-supervised representation learning in the computer vision domain."
Contrastive loss: An objective that increases similarity of positive pairs and decreases similarity of negative pairs in the embedding space. "Definition 6 (Contrastive Loss): A contrastive loss func- tion operates on a set of metric embedding pairs {(z, z+), (z, z")} of the query, positive and negative keys."
Contrastive Predictive Coding (CPC): A method that contrasts future representations with past context to learn predictive features. "In Contrastive Predictive Coding (CPC) [77], context features are constructed as a summary of past input segments, and then contrasted with local features from a future time step."
Contrastive Representation Learning (CRL): A general framework for contrastive methods that unifies diverse approaches across domains. "In this paper, we formulate and discuss a Contrastive Representation Learning (CRL) framework, which poten- tially represents another paradigm shift from architecture- engineering to data-engineering."
Context-instance relationship: A design principle where local parts are contrasted against global context to learn meaningful features. "Another approach to extracting similar views of the same scene is by exploiting the context-instance relationship from a sample representation."
Deep InfoMax (DIM): A contrastive approach that maximizes agreement between global and local features of images. "Fig. 7a describes the approach taken in Deep InfoMax (DIM) [46], where an image is encoded into a global feature vector and also into a feature map corresponding to spatial patches of pixels in the original image."
Disentangled representation: Representations where independent factors of variation are separated across components. "* Disentangled representation: While a good representa- tion should capture as many factors and discard as little data as possible, each factor should be as disentangled as possible."
Distributed representations: Expressive, compact encodings that can represent many configurations relative to their size. "* Distributed: Representations that are expressive and can represent an exponential amount of configuration for their size."
Dissimilarity distribution: The distribution defining how to sample negative keys that should be dissimilar to a given query. "A key is considered positive k+ for a query q if it is sampled from this similarity distribution and is considered negative k- if it is sampled from the dissimilarity distribution p (q, k-)."
Encoder: The model component that maps input views to representation vectors. "Definition 4 (Encoder): The features encoder e(x; (e) : X > V with parameters He learns a mapping from the input views x E X to a representation vector v E Rd."
End-to-end encoders: Encoders updated directly via backpropagation through the contrastive loss, often sharing weights between query and key. "End-to-end encoders represent the most simple method both conceptually and technically, where the encoders for the queries and keys are updated directly using gradients back-propagated with respect to the contrastive loss function."
Generative model: A model that learns the data distribution and can generate samples, often used for representation learning. "Generative approaches learn represen- tations by modelling the data distribution p(x), for example: all the pixels in an image."
InfoNCE: A popular contrastive loss variant used to distinguish a positive from many negatives. "The non-parametric classification loss [110] and its variants, such as InfoNCE [77] and NT-Xent [16] is a popular choice for the contrastive loss function"
Instance Discrimination: A self-supervised task that treats each instance as its own class to learn separable representations. "Instance Dis- crimination [110] is a popular self-supervised method to learn a visual representation"
Latent variables: Hidden variables that represent underlying factors explaining data variations. "extracting representations, or inferring latent variables from a probabilistic view of a dataset, is often called inference."
Manifold: A lower-dimensional space in which high-dimensional data representations reside. "the encoded representations reside in a manifold of a much lower dimensionality."
Memory bank: A mechanism that stores representations to provide many negatives without large batch sizes. "decoupled the batch size from the number of negative pairs by storing a detached copy of representa- tions of the entire dataset into a separate memory bank."
Metric embedding: A transformed representation where distances or similarities are computed for contrastive objectives. "to obtain a metric embedding z = h(v), where z E Rd'"
Metric Learning: A field focused on learning distance functions or embeddings where semantic similarity is reflected in distances. "spanned across many fields and domains including Metric Learning and natural language processing."
Momentum Contrast (MoCo): A contrastive method using a momentum-updated encoder and a queue to build large sets of negatives. "Momentum Contrast (MoCo) [43] further reduces the need to store an offline representation of the entire dataset in the memory bank through the use of a dynamic memory queue."
Momentum encoder: An encoder updated as a moving average of the online encoder to provide stable keys. "The offline momentum encoder is a copy of the online encoder,"
Multi-layer perceptron (MLP): A small feedforward network often used as a projection head in contrastive setups. "comprised of a small multi-layer perceptron (MLP) to obtain a metric embedding z = h(v)"
Non-parametric classification loss: A contrastive objective formulated as a classification over one positive and many negatives without learned class parameters. "The non-parametric classification loss [110] and its variants, such as InfoNCE [77] and NT-Xent [16] is a popular choice for the contrastive loss function"
NT-Xent: The normalized temperature-scaled cross entropy loss used in SimCLR-style contrastive learning. "such as InfoNCE [77] and NT-Xent [16] is a popular choice for the contrastive loss function"
Predictive coding theory: A theory where the brain (or model) predicts future inputs; CPC instantiates it via contrastive learning. "can be thought of as an instantiation of the predictive coding theory."
Projection head: A network module that maps representations to a space suited for the contrastive loss. "Each representation v is then fed into a projection head h(·) comprised of a small multi-layer perceptron (MLP) to obtain a metric embedding z = h(v)"
Prototypical Contrastive Learning (PCL): A method combining contrastive learning with clustering via prototypes. "such as Prototypical Con- trastive Learning (PCL) [63], or Swapping Assignment between multiple views (SwAV) [14]."
Query and key: Paired views used in contrastive learning; queries are matched to positive keys and contrasted with negative keys. "Definition 1 (Query, Key): Query and key refer to a par- ticular view of an input sample x € X."
ResNet: A residual network architecture commonly used as the encoder for images. "a ResNet [42] model is usually used for image data because of its simplicity."
Self-supervised learning: Learning representations without human labels by leveraging structure or pretext tasks in data. "Since a self-supervised discriminative model does not have labels corresponding to the inputs like its supervised counterparts, the success of self-supervised methods comes from the elegant design of the pretext tasks"
Sequential coherence and consistency: The assumption that important features change slowly across sequences, used to define positive/negative pairs. "exploiting the spatial or temporal coherence and consistency in a sequence of observations is another approach to defining similarity in contrastive learning."
Similarity distribution: A joint distribution over pairs that formalizes which inputs should be treated as similar. "Definition 2 (Similarity Distribution): A similarity distri- bution p+(q, k+) is a joint distribution over a pair of input samples that formalises the notion of similarity (and dis- similarity) in the contrastive learning task."
Softmax classification: A probabilistic classification formulation used to interpret InfoNCE as (K+1)-way softmax. "a non-parametric version of (K + 1)-way softmax classification [110]"
Swapping Assignment between multiple views (SwAV): A clustering-augmented contrastive method using swapped assignments between augmented views. "such as Prototypical Con- trastive Learning (PCL) [63], or Swapping Assignment between multiple views (SwAV) [14]."
Temporal coherence: A property of video sequences where adjacent frames are similar, enabling contrastive positives. "the temporal coherence of video frames can also provide a natural source of data transformations."
Time-Contrastive Network (TCN): A framework that contrasts time-adjacent frames or viewpoints to learn temporally coherent features. "Rather than using simultaneous videos with multiple view- points as in Time-Contrastive Network (TCN) [27], [91] uses a multi-frame TCN"
Transfer learning: Reusing learned representations for downstream tasks in new datasets or domains. "learn- ing useful representations that achieve state-of-the-art results in transfer learning for some downstream computer vision tasks"
Transform head: A module that converts representations into embeddings (possibly aggregating multiple inputs) for contrastive comparison. "Definition 5 (Transform Head): Transform heads h(v; Oh) : V -> Z parameterised by Oh, are modules that transform the feature embedding v E V into a metric embedding Z E Rd'."

View Paper Prompt View All Prompts

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Continue Learning

Authors (3)

Collections

YouTube

Show All Videos

Contrastive Representation Learning: A Framework and Review

Summary

Contrastive Representation Learning: A Comprehensive Review

Framework and Components

Historical Context and Development

Practical Implications and Applications

Discussion and Future Directions

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Glossary

Open Problems

Continue Learning

Related Papers

Authors (3)

Collections

YouTube