Contrastive Representation Learning: A Framework and Review
Abstract: Contrastive Learning has recently received interest due to its success in self-supervised representation learning in the computer vision domain. However, the origins of Contrastive Learning date as far back as the 1990s and its development has spanned across many fields and domains including Metric Learning and natural language processing. In this paper we provide a comprehensive literature review and we propose a general Contrastive Representation Learning framework that simplifies and unifies many different contrastive learning methods. We also provide a taxonomy for each of the components of contrastive learning in order to summarise it and distinguish it from other forms of machine learning. We then discuss the inductive biases which are present in any contrastive learning system and we analyse our framework under different views from various sub-fields of Machine Learning. Examples of how contrastive learning has been applied in computer vision, natural language processing, audio processing, and others, as well as in Reinforcement Learning are also presented. Finally, we discuss the challenges and some of the most promising future research directions ahead.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Glossary
- Abstraction and Invariant: Properties of representations that capture abstract concepts and remain unchanged under small local input variations. "* Abstraction and Invariant: Good representations can capture more abstract concepts that are invariant to small and local changes in input data;"
- AutoEncoder: A neural network trained to encode inputs and reconstruct them, often used for generative representation learning. "* AutoEncoder"
- Back-translation: A data augmentation technique for text that translates a sentence to another language and back to create a semantically equivalent variant. "Fang et al. [29] transform a sentence using a back-translation method to create a slightly different sentence that has the same semantic meaning as the original one to form a positive pair."
- Bayes' rule: A probabilistic principle used to compute conditional distributions from generative models. "Evaluating the conditional distribution p(y|x) for some discriminative tasks on variable y can then be obtained by using Bayes' rule."
- Contrastive Learning: Learning representations by comparing pairs of inputs, pulling similar ones together and pushing dissimilar ones apart. "Contrastive Learning has recently received interest due to its success in self-supervised representation learning in the computer vision domain."
- Contrastive loss: An objective that increases similarity of positive pairs and decreases similarity of negative pairs in the embedding space. "Definition 6 (Contrastive Loss): A contrastive loss func- tion operates on a set of metric embedding pairs {(z, z+), (z, z")} of the query, positive and negative keys."
- Contrastive Predictive Coding (CPC): A method that contrasts future representations with past context to learn predictive features. "In Contrastive Predictive Coding (CPC) [77], context features are constructed as a summary of past input segments, and then contrasted with local features from a future time step."
- Contrastive Representation Learning (CRL): A general framework for contrastive methods that unifies diverse approaches across domains. "In this paper, we formulate and discuss a Contrastive Representation Learning (CRL) framework, which poten- tially represents another paradigm shift from architecture- engineering to data-engineering."
- Context-instance relationship: A design principle where local parts are contrasted against global context to learn meaningful features. "Another approach to extracting similar views of the same scene is by exploiting the context-instance relationship from a sample representation."
- Deep InfoMax (DIM): A contrastive approach that maximizes agreement between global and local features of images. "Fig. 7a describes the approach taken in Deep InfoMax (DIM) [46], where an image is encoded into a global feature vector and also into a feature map corresponding to spatial patches of pixels in the original image."
- Disentangled representation: Representations where independent factors of variation are separated across components. "* Disentangled representation: While a good representa- tion should capture as many factors and discard as little data as possible, each factor should be as disentangled as possible."
- Distributed representations: Expressive, compact encodings that can represent many configurations relative to their size. "* Distributed: Representations that are expressive and can represent an exponential amount of configuration for their size."
- Dissimilarity distribution: The distribution defining how to sample negative keys that should be dissimilar to a given query. "A key is considered positive k+ for a query q if it is sampled from this similarity distribution and is considered negative k- if it is sampled from the dissimilarity distribution p (q, k-)."
- Encoder: The model component that maps input views to representation vectors. "Definition 4 (Encoder): The features encoder e(x; (e) : X > V with parameters He learns a mapping from the input views x E X to a representation vector v E Rd."
- End-to-end encoders: Encoders updated directly via backpropagation through the contrastive loss, often sharing weights between query and key. "End-to-end encoders represent the most simple method both conceptually and technically, where the encoders for the queries and keys are updated directly using gradients back-propagated with respect to the contrastive loss function."
- Generative model: A model that learns the data distribution and can generate samples, often used for representation learning. "Generative approaches learn represen- tations by modelling the data distribution p(x), for example: all the pixels in an image."
- InfoNCE: A popular contrastive loss variant used to distinguish a positive from many negatives. "The non-parametric classification loss [110] and its variants, such as InfoNCE [77] and NT-Xent [16] is a popular choice for the contrastive loss function"
- Instance Discrimination: A self-supervised task that treats each instance as its own class to learn separable representations. "Instance Dis- crimination [110] is a popular self-supervised method to learn a visual representation"
- Latent variables: Hidden variables that represent underlying factors explaining data variations. "extracting representations, or inferring latent variables from a probabilistic view of a dataset, is often called inference."
- Manifold: A lower-dimensional space in which high-dimensional data representations reside. "the encoded representations reside in a manifold of a much lower dimensionality."
- Memory bank: A mechanism that stores representations to provide many negatives without large batch sizes. "decoupled the batch size from the number of negative pairs by storing a detached copy of representa- tions of the entire dataset into a separate memory bank."
- Metric embedding: A transformed representation where distances or similarities are computed for contrastive objectives. "to obtain a metric embedding z = h(v), where z E Rd'"
- Metric Learning: A field focused on learning distance functions or embeddings where semantic similarity is reflected in distances. "spanned across many fields and domains including Metric Learning and natural language processing."
- Momentum Contrast (MoCo): A contrastive method using a momentum-updated encoder and a queue to build large sets of negatives. "Momentum Contrast (MoCo) [43] further reduces the need to store an offline representation of the entire dataset in the memory bank through the use of a dynamic memory queue."
- Momentum encoder: An encoder updated as a moving average of the online encoder to provide stable keys. "The offline momentum encoder is a copy of the online encoder,"
- Multi-layer perceptron (MLP): A small feedforward network often used as a projection head in contrastive setups. "comprised of a small multi-layer perceptron (MLP) to obtain a metric embedding z = h(v)"
- Non-parametric classification loss: A contrastive objective formulated as a classification over one positive and many negatives without learned class parameters. "The non-parametric classification loss [110] and its variants, such as InfoNCE [77] and NT-Xent [16] is a popular choice for the contrastive loss function"
- NT-Xent: The normalized temperature-scaled cross entropy loss used in SimCLR-style contrastive learning. "such as InfoNCE [77] and NT-Xent [16] is a popular choice for the contrastive loss function"
- Predictive coding theory: A theory where the brain (or model) predicts future inputs; CPC instantiates it via contrastive learning. "can be thought of as an instantiation of the predictive coding theory."
- Projection head: A network module that maps representations to a space suited for the contrastive loss. "Each representation v is then fed into a projection head h(·) comprised of a small multi-layer perceptron (MLP) to obtain a metric embedding z = h(v)"
- Prototypical Contrastive Learning (PCL): A method combining contrastive learning with clustering via prototypes. "such as Prototypical Con- trastive Learning (PCL) [63], or Swapping Assignment between multiple views (SwAV) [14]."
- Query and key: Paired views used in contrastive learning; queries are matched to positive keys and contrasted with negative keys. "Definition 1 (Query, Key): Query and key refer to a par- ticular view of an input sample x € X."
- ResNet: A residual network architecture commonly used as the encoder for images. "a ResNet [42] model is usually used for image data because of its simplicity."
- Self-supervised learning: Learning representations without human labels by leveraging structure or pretext tasks in data. "Since a self-supervised discriminative model does not have labels corresponding to the inputs like its supervised counterparts, the success of self-supervised methods comes from the elegant design of the pretext tasks"
- Sequential coherence and consistency: The assumption that important features change slowly across sequences, used to define positive/negative pairs. "exploiting the spatial or temporal coherence and consistency in a sequence of observations is another approach to defining similarity in contrastive learning."
- Similarity distribution: A joint distribution over pairs that formalizes which inputs should be treated as similar. "Definition 2 (Similarity Distribution): A similarity distri- bution p+(q, k+) is a joint distribution over a pair of input samples that formalises the notion of similarity (and dis- similarity) in the contrastive learning task."
- Softmax classification: A probabilistic classification formulation used to interpret InfoNCE as (K+1)-way softmax. "a non-parametric version of (K + 1)-way softmax classification [110]"
- Swapping Assignment between multiple views (SwAV): A clustering-augmented contrastive method using swapped assignments between augmented views. "such as Prototypical Con- trastive Learning (PCL) [63], or Swapping Assignment between multiple views (SwAV) [14]."
- Temporal coherence: A property of video sequences where adjacent frames are similar, enabling contrastive positives. "the temporal coherence of video frames can also provide a natural source of data transformations."
- Time-Contrastive Network (TCN): A framework that contrasts time-adjacent frames or viewpoints to learn temporally coherent features. "Rather than using simultaneous videos with multiple view- points as in Time-Contrastive Network (TCN) [27], [91] uses a multi-frame TCN"
- Transfer learning: Reusing learned representations for downstream tasks in new datasets or domains. "learn- ing useful representations that achieve state-of-the-art results in transfer learning for some downstream computer vision tasks"
- Transform head: A module that converts representations into embeddings (possibly aggregating multiple inputs) for contrastive comparison. "Definition 5 (Transform Head): Transform heads h(v; Oh) : V -> Z parameterised by Oh, are modules that transform the feature embedding v E V into a metric embedding Z E Rd'."
Collections
Sign up for free to add this paper to one or more collections.