I-Con: A Unifying Framework for Representation Learning

Published 23 Apr 2025 in cs.LG, cs.AI, cs.CV, and cs.IT | (2504.16929v1)

Abstract: As the field of representation learning grows, there has been a proliferation of different loss functions to solve different classes of problems. We introduce a single information-theoretic equation that generalizes a large collection of modern loss functions in machine learning. In particular, we introduce a framework that shows that several broad classes of machine learning methods are precisely minimizing an integrated KL divergence between two conditional distributions: the supervisory and learned representations. This viewpoint exposes a hidden information geometry underlying clustering, spectral methods, dimensionality reduction, contrastive learning, and supervised learning. This framework enables the development of new loss functions by combining successful techniques from across the literature. We not only present a wide array of proofs, connecting over 23 different approaches, but we also leverage these theoretical results to create state-of-the-art unsupervised image classifiers that achieve a +8% improvement over the prior state-of-the-art on unsupervised classification on ImageNet-1K. We also demonstrate that I-Con can be used to derive principled debiasing methods which improve contrastive representation learners.

Abstract PDF Upgrade to Chat

Summary

The paper introduces I-Con as a unifying framework that connects supervised, unsupervised, contrastive, and clustering methods by minimizing the KL divergence between conditional distributions.
It formalizes over 23 existing algorithms as special cases of one loss function, demonstrating how techniques like t-SNE, SimCLR, and K-Means seamlessly emerge from its design.
Experimental results reveal that Debiased InfoNCE Clustering achieves 67.52% Hungarian accuracy on ImageNet-1K, outperforming previous state-of-the-art methods through effective debiasing strategies.

This paper introduces Information Contrastive Learning (I-Con), a unifying framework for representation learning based on a single information-theoretic objective. The core idea is that many disparate representation learning methods—spanning supervised, unsupervised, contrastive, clustering, spectral, and dimensionality reduction approaches—can be understood as minimizing the average Kullback-Leibler (KL) divergence between two conditional probability distributions that define relationships (or "neighborhoods") between data points.

The central I-Con loss function is:

$\mathcal{L}(\theta, \phi) = \int_{i \in \mathcal{X}} D_{\text{KL}} \left( p_{\theta}(\cdot|i) \,\|\, q_{\phi}(\cdot|i ) \right) = \int_{i \in \mathcal{X}} \int_{j \in \mathcal{X}} p_{\theta}(j|i) \log \frac{p_{\theta}(j|i)}{q_{\phi}(j|i)}$

Here, $p_\theta(j|i)$ represents the probability of transitioning from data point $i$ to data point $j$ according to a supervisory or target relationship, while $q_\phi(j|i)$ represents the transition probability based on a learned representation (e.g., embeddings, cluster assignments). Typically, $p$ is fixed (defining the desired structure), and $q$ is learned by optimizing the parameters $\phi$ of a model (like a neural network) to make $q_\phi$ mimic $p_\theta$ .

The power of I-Con lies in its ability to generalize numerous existing methods by selecting specific forms for $p_\theta$ and $q_\phi$ . The paper provides proofs (15 theorems in total, detailed in the appendix) showing how over 23 methods emerge as special cases. Examples include:

SNE/t-SNE: $p$ is a Gaussian/t-distribution based on distances in the original high-dimensional space, and $q$ is a Gaussian/t-distribution based on distances in the learned low-dimensional embedding space $\phi$ .
SimCLR/InfoNCE: $p$ is a uniform distribution over positive pairs (e.g., augmentations of the same image), and $q$ is a softmax distribution over cosine similarities between learned features $f_\phi(x)$ .
K-Means: $p$ is based on Gaussian distances between data points, while $q$ represents the probability that two points $i$ and $j$ belong to the same cluster based on learned assignments $\phi$ . Minimizing the I-Con loss in this case relates to minimizing the K-Means objective plus an entropy term on the cluster assignments.
Supervised Cross-Entropy: $p$ is an indicator function for the correct class label, and $q$ is the softmax output of a classifier over class prototypes $\phi$ .

Table 1 in the paper provides a comprehensive overview of how different choices for $p_\theta(j|i)$ and $q_\phi(j|i)$ recover various well-known algorithms.

Beyond unification, I-Con serves as a principled framework for developing new representation learning methods by transferring techniques across domains. The authors demonstrate this by enhancing unsupervised clustering. They identify limitations in existing methods and borrow the concept of "debiasing" from contrastive learning (Chuang et al., 2020) to improve clustering performance. Two debiasing strategies applied to the supervisory distribution $p(j|i)$ are proposed:

Debiasing through Uniform Distribution: A simple mixture model is used:

$\tilde{p}(j|i) = (1 - \alpha) p(j|i) + \frac{\alpha}{N}$

This adds a small uniform probability mass ( $\alpha/N$ ) to all potential neighbors, reducing overconfidence and mitigating the impact of potential "false negative" pairings inherent in the original $p(j|i)$ . This is analogous to label smoothing in supervised learning.
Debiasing through Neighbor Propagation: Instead of just using immediate neighbors (like KNNs) to define $p$ , the neighborhood is expanded by considering points reachable within multiple steps (walks) on the neighbor graph. This creates a denser, potentially more robust supervisory signal.

$\tilde{P} \propto P + P^2 + \dots + P^k$

or a uniform version over reachable nodes.

These debiasing techniques, inspired by contrastive learning but implemented by modifying $p$ within the I-Con framework, are applied to improve clustering.

The experiments focus on unsupervised image classification (clustering) on ImageNet-1K using features from pre-trained DiNO ViT models (Caron et al., 2021). The primary evaluation metric is Hungarian accuracy. The authors propose a new method called "Debiased InfoNCE Clustering," derived using I-Con principles:

Supervisory $p(j|i)$ : Defined using a combination of augmentations, K-Nearest Neighbors (KNNs, k=3), 1-step walks on the KNN graph, and uniform debiasing ( $\alpha > 0$ ).
Learned $q_\phi(j|i)$ : Uses the "shared cluster likelihood" kernel derived from the probabilistic K-Means analysis (see Table 1 and Appendix Section D).

Key Results:

Debiased InfoNCE Clustering significantly outperforms prior state-of-the-art methods like TEMI (Adaloglou et al., 2023) on ImageNet-1K clustering. Using DiNO ViT-L/14 features, it achieves 67.52% Hungarian accuracy, an improvement of over +8% compared to the best reported baseline (SCAN (Gansbeke et al., 2020) at 60.15%, as TEMI didn't report ViT-L results, but the ViT-B gain over TEMI was +6.13%).
Ablation studies confirm the benefits of both debiasing strategies:
- Uniform debiasing (mixing with $\alpha/N$ ) improves performance and stability (Figure 1), with optimal $\alpha$ often between 0.4 and 0.8 (Figure 2). Applying debiasing to both $p$ and $q$ (implicitly, by adjusting $\alpha$ in the target $p$ ) yields further gains.
- Neighbor propagation (using KNNs and 1 or 2-step walks) boosts accuracy compared to using only augmentations (Table 4).
Additional experiments in the appendix show that the proposed uniform debiasing strategy also improves standard feature learning (not just clustering) on CIFAR and STL-10 benchmarks, outperforming both SimCLR (Chen et al., 2020) and the original Debiased Contrastive Learning (DCL) (Chuang et al., 2020), especially when using a Student's t-distribution for $q_\phi$ .

In summary, the paper presents I-Con as a powerful theoretical lens that reveals underlying connections between diverse representation learning algorithms. Its practical value is demonstrated by using the framework to systematically transfer the idea of debiasing from contrastive learning to unsupervised clustering, resulting in a new state-of-the-art method for unsupervised ImageNet classification. The framework encourages principled design of new loss functions by combining components (choices of $p$ and $q$ ) proven successful in different contexts.