- The paper introduces I-Con as a unifying framework that connects supervised, unsupervised, contrastive, and clustering methods by minimizing the KL divergence between conditional distributions.
- It formalizes over 23 existing algorithms as special cases of one loss function, demonstrating how techniques like t-SNE, SimCLR, and K-Means seamlessly emerge from its design.
- Experimental results reveal that Debiased InfoNCE Clustering achieves 67.52% Hungarian accuracy on ImageNet-1K, outperforming previous state-of-the-art methods through effective debiasing strategies.
This paper introduces Information Contrastive Learning (I-Con), a unifying framework for representation learning based on a single information-theoretic objective. The core idea is that many disparate representation learning methods—spanning supervised, unsupervised, contrastive, clustering, spectral, and dimensionality reduction approaches—can be understood as minimizing the average Kullback-Leibler (KL) divergence between two conditional probability distributions that define relationships (or "neighborhoods") between data points.
The central I-Con loss function is:
L(θ,ϕ)=∫i∈XDKL(pθ(⋅∣i)∥qϕ(⋅∣i))=∫i∈X∫j∈Xpθ(j∣i)logqϕ(j∣i)pθ(j∣i)
Here, pθ(j∣i) represents the probability of transitioning from data point i to data point j according to a supervisory or target relationship, while qϕ(j∣i) represents the transition probability based on a learned representation (e.g., embeddings, cluster assignments). Typically, p is fixed (defining the desired structure), and q is learned by optimizing the parameters ϕ of a model (like a neural network) to make qϕ mimic pθ.
The power of I-Con lies in its ability to generalize numerous existing methods by selecting specific forms for pθ and qϕ. The paper provides proofs (15 theorems in total, detailed in the appendix) showing how over 23 methods emerge as special cases. Examples include:
- SNE/t-SNE: p is a Gaussian/t-distribution based on distances in the original high-dimensional space, and q is a Gaussian/t-distribution based on distances in the learned low-dimensional embedding space ϕ.
- SimCLR/InfoNCE: p is a uniform distribution over positive pairs (e.g., augmentations of the same image), and q is a softmax distribution over cosine similarities between learned features fϕ(x).
- K-Means: p is based on Gaussian distances between data points, while q represents the probability that two points i and j belong to the same cluster based on learned assignments ϕ. Minimizing the I-Con loss in this case relates to minimizing the K-Means objective plus an entropy term on the cluster assignments.
- Supervised Cross-Entropy: p is an indicator function for the correct class label, and q is the softmax output of a classifier over class prototypes ϕ.
Table 1 in the paper provides a comprehensive overview of how different choices for pθ(j∣i) and qϕ(j∣i) recover various well-known algorithms.
Beyond unification, I-Con serves as a principled framework for developing new representation learning methods by transferring techniques across domains. The authors demonstrate this by enhancing unsupervised clustering. They identify limitations in existing methods and borrow the concept of "debiasing" from contrastive learning (Chuang et al., 2020) to improve clustering performance. Two debiasing strategies applied to the supervisory distribution p(j∣i) are proposed:
- Debiasing through Uniform Distribution: A simple mixture model is used:
p~(j∣i)=(1−α)p(j∣i)+Nα
This adds a small uniform probability mass (α/N) to all potential neighbors, reducing overconfidence and mitigating the impact of potential "false negative" pairings inherent in the original p(j∣i). This is analogous to label smoothing in supervised learning.
- Debiasing through Neighbor Propagation: Instead of just using immediate neighbors (like KNNs) to define p, the neighborhood is expanded by considering points reachable within multiple steps (walks) on the neighbor graph. This creates a denser, potentially more robust supervisory signal.
P~∝P+P2+⋯+Pk
or a uniform version over reachable nodes.
These debiasing techniques, inspired by contrastive learning but implemented by modifying p within the I-Con framework, are applied to improve clustering.
The experiments focus on unsupervised image classification (clustering) on ImageNet-1K using features from pre-trained DiNO ViT models (Caron et al., 2021). The primary evaluation metric is Hungarian accuracy. The authors propose a new method called "Debiased InfoNCE Clustering," derived using I-Con principles:
- Supervisory p(j∣i): Defined using a combination of augmentations, K-Nearest Neighbors (KNNs, k=3), 1-step walks on the KNN graph, and uniform debiasing (α>0).
- Learned qϕ(j∣i): Uses the "shared cluster likelihood" kernel derived from the probabilistic K-Means analysis (see Table 1 and Appendix Section D).
Key Results:
- Debiased InfoNCE Clustering significantly outperforms prior state-of-the-art methods like TEMI (Adaloglou et al., 2023) on ImageNet-1K clustering. Using DiNO ViT-L/14 features, it achieves 67.52% Hungarian accuracy, an improvement of over +8% compared to the best reported baseline (SCAN (Gansbeke et al., 2020) at 60.15%, as TEMI didn't report ViT-L results, but the ViT-B gain over TEMI was +6.13%).
- Ablation studies confirm the benefits of both debiasing strategies:
- Uniform debiasing (mixing with α/N) improves performance and stability (Figure 1), with optimal α often between 0.4 and 0.8 (Figure 2). Applying debiasing to both p and q (implicitly, by adjusting α in the target p) yields further gains.
- Neighbor propagation (using KNNs and 1 or 2-step walks) boosts accuracy compared to using only augmentations (Table 4).
- Additional experiments in the appendix show that the proposed uniform debiasing strategy also improves standard feature learning (not just clustering) on CIFAR and STL-10 benchmarks, outperforming both SimCLR (Chen et al., 2020) and the original Debiased Contrastive Learning (DCL) (Chuang et al., 2020), especially when using a Student's t-distribution for qϕ.
In summary, the paper presents I-Con as a powerful theoretical lens that reveals underlying connections between diverse representation learning algorithms. Its practical value is demonstrated by using the framework to systematically transfer the idea of debiasing from contrastive learning to unsupervised clustering, resulting in a new state-of-the-art method for unsupervised ImageNet classification. The framework encourages principled design of new loss functions by combining components (choices of p and q) proven successful in different contexts.