Papers
Topics
Authors
Recent
Search
2000 character limit reached

Debiased InfoNCE for Robust Mutual Information Estimation

Updated 16 January 2026
  • Debiased InfoNCE is a modified contrastive loss that corrects inherent negative sampling bias to achieve faithful density-ratio and mutual information estimation.
  • It employs corrections such as auxiliary anchor classes, false-negative subtraction, and positive-unlabeled mining to ensure unbiased and consistent representation learning.
  • Empirical results show improved performance and fairness across recommendation systems, graph contrastive learning, and supervised metric tasks.

Debiased InfoNCE refers to a suite of principled modifications to the classic InfoNCE loss, targeting the systematic biases that arise from negative sampling, dataset confounders, or density-ratio indeterminacy in contrastive learning frameworks. These debiasing strategies span mutual information estimation, graph contrastive learning, supervised metric learning, and recommendation systems. While standard InfoNCE excels in learning structured density ratios, it incurs bias due to its inherent loss formulation and sampling procedures. Debiased variants are designed to achieve Fisher-consistent density-ratio estimation, unbiased mutual-information estimation, or robustness to dataset or sampling artifacts, with empirical benefits documented across several modalities.

1. Formal Definition and Bias in InfoNCE

InfoNCE is a contrastive loss originally formulated to upper-bound the mutual information I(X;Y)I(X;Y) for random variables (X,Y)(X,Y) via discriminating a single positive sample from KK negatives:

LInfoNCE(θ)=E[logecθ(x1,y)1Kj=1Kecθ(xj,y)]L_{\mathrm{InfoNCE}}(\theta) = -\mathbb{E} \left[ \log \frac{e^{c_\theta(x_1,y)}}{\frac{1}{K}\sum_{j=1}^K e^{c_\theta(x_j, y)}} \right]

where the critic cθ(x,y)c_\theta(x,y) scores the compatibility, (x1,y)(x_1,y) denotes a positive pair drawn from p(x,y)p(x, y), and (x2,,xK)(x_2, \ldots, x_K) are negatives drawn from p(x)p(x). When cθ(x,y)=logrθ(x,y)c_\theta(x, y) = \log r_\theta(x, y), one can interpret the negative loss as a form of KK-way Jensen-Shannon divergence.

For any finite KK, InfoNCE is a lower bound on I(X;Y)I(X;Y), with bias

BiasInfoNCE=I(X;Y)IInfoNCE=D(p(xy)p(x))DK-JS(p(xy),p(x))\mathrm{Bias}_{\mathrm{InfoNCE}} = I(X;Y) - I_{\mathrm{InfoNCE}} = D(p(x|y) \| p(x)) - D_{K\text{-JS}}(p(x|y), p(x))

which remains strictly positive for all finite KK (Ryu et al., 29 Oct 2025). As such, InfoNCE systematically underestimates mutual information.

2. Debiasing via Auxiliary Classes: The InfoNCE-Anchor Approach

To eliminate the indeterminacy in learned density ratios, InfoNCE-anchor introduces an auxiliary anchor class in the underlying tensorized classification problem. Specifically, for two densities q1(x)q_1(x) (positive) and q0(x)q_0(x) (noise), K+1K+1 classes are defined:

  • Class 0 (anchor): q0(x1)q0(xK)q_0(x_1)\cdots q_0(x_K)
  • Class i{1,,K}i \in \{1,\dots,K\}: q1(xi)jiq0(xj)q_1(x_i)\prod_{j \neq i} q_0(x_j)

Class priors p(0)=v/(K+v)p(0) = v/(K+v) and p(i)=1/(K+v)p(i) = 1/(K+v) for i1i \ge 1 (v>0v > 0) are assigned. The posterior is modeled,

p(zx1:K)={vv+j=1Kr(xj),z=0 r(xz)v+j=1Kr(xj),1zKp(z \mid x_{1:K}) = \begin{cases} \frac{v}{v + \sum_{j=1}^K r^*(x_j)}, & z = 0 \ \frac{r^*(x_z)}{v + \sum_{j=1}^K r^*(x_j)}, & 1 \le z \le K \end{cases}

where r(x)=q1(x)q0(x)r^*(x) = \frac{q_1(x)}{q_0(x)}. Optimization of the InfoNCE-anchor objective (cross-entropy loss over classes) is Fisher-consistent, yielding rθ(x)=r(x)r_{\theta^*}(x) = r^*(x) (Theorem 3), removing the indeterminacy and enabling consistent density-ratio estimation (Ryu et al., 29 Oct 2025).

3. Debiased InfoNCE in Recommendation and Pointwise Losses

In recommendation, negative sampling from the marginal pup_u often contaminates the denominator with false negatives, especially when positives (items with observed user interactions) are not completely observed. Debiased InfoNCE (Jin et al., 2023, Li et al., 2023) corrects this by analytically subtracting the expected contribution of false negatives. For user uu, positive fraction τu+\tau_u^+ and negative fraction τu\tau_u^-, the empirical debiased denominator is

fdebias,u=max{1τu(1Nn=1Ney^ujn/ττu+1Mm=1Mey^ukm/τ),e1/τ}f_{\mathrm{debias}, u} = \max \left\{ \frac{1}{\tau_u^-} \left( \frac{1}{N} \sum_{n=1}^N e^{\hat{y}_{uj_n} / \tau} - \tau_u^+ \frac{1}{M} \sum_{m=1}^M e^{\hat{y}_{uk_m} / \tau} \right), e^{-1/\tau} \right\}

Debiased InfoNCE thus becomes

LInfoNCEdebiased=EuEipu+[logey^ui/τey^ui/τ+λfdebias,u]L_{\mathrm{InfoNCE}}^{\mathrm{debiased}} = -\mathbb{E}_u \mathbb{E}_{i \sim p^+_u} \left[ \log \frac{e^{\hat{y}_{ui}/\tau}}{e^{\hat{y}_{ui}/\tau} + \lambda f_{\mathrm{debias}, u}} \right]

Unbiasedness is theoretically guaranteed by construction; empirical gains in recommendation (Recall@20, NDCG@20) consistently confirm the advantage of the debiased variant (1.7% improvement over InfoNCE; MINE+ up to 11.5%) (Jin et al., 2023, Li et al., 2023).

4. Positive-Unlabeled Correction in Graph Contrastive Learning

In GCL, InfoNCE suffers from semantic bias when treating all non-augmented pairs as negatives, ignoring that some may be true positives (semantically similar by graph structure or attributes). Wang et al. reinterpret GCL as a Positive-Unlabeled (PU) learning problem and prove that InfoNCE scores sθ(n,n)s_\theta(n, n') rank pairs by their probability of positivity (“free lunch” theorem). After warm-up, pseudo-positive pairs among unlabeled negatives are extracted via thresholding sθs_\theta; the corrected likelihood objective then maximizes the probability of both labeled and mined positives, weighted by confidence s^θ\hat{s}_\theta and a factor β<1\beta < 1:

Ln,ncorr=log[Pn,n(n,n)DU+(Pn,n)βs^θ(n,n)]L_{n,n'}^{\mathrm{corr}} = -\log \left[ P_{n,n'} \prod_{(n, n'') \in D_U^+} (P_{n,n''})^{\beta \hat{s}_\theta(n, n'')} \right]

Empirical gains in node classification accuracy, especially in out-of-domain scenarios, support the value of semantically guided debiasing (up to +9.05 pp on GOODCBAS). Synergy with LLM-based features further enhances hidden-positive recovery (Wang et al., 7 May 2025).

5. Debiased Losses in Supervised Contrastive Learning

Barbano et al. expose how dataset bias (e.g., spurious correlations) can undermine InfoNCE and SupCon, with positive samples grouped by bias rather than true class. They frame debiased contrastive learning as enforcing an ε\varepsilon-margin between positives and negatives,

ε-SupInfoNCE=ilogexp(si+)exp(si+ε)+jexp(sj)\varepsilon\text{-SupInfoNCE} = -\sum_i \log \frac{ \exp(s^+_i) }{ \exp(s^+_i - \varepsilon) + \sum_j \exp(s^-_j) }

with ε>0\varepsilon > 0 enforcing a minimal gap. The FairKL regularizer matches anchor-to-positive and anchor-to-negative distance distributions across bias-aligned and bias-conflicting sets, ensuring that learned representations are robust and minimize bias. Combined, ε\varepsilon-SupInfoNCE and FairKL achieve state-of-the-art debiasing on synthetic and realistic benchmarks (Biased-MNIST, Corrupted-CIFAR10, bFFHQ), with unbiased test accuracy up to 90.5%\sim90.5\% (Barbano et al., 2022).

6. Unified Decision-Theoretic Framework and Implications

The consistent pattern across domains is that debiased InfoNCE is enabled by explicit correction mechanisms—anchor classes, analytical expectation subtraction, positive mining, or margin regularization. Theoretical properties center on Fisher consistency, unbiased mutual information estimation, and robust density-ratio recovery. Under a decision-theoretic framework, these corrections generalize beyond InfoNCE to chi-squared plugin estimators (X2X^2), ff-divergence estimators, and more, via selection of proper scoring rules (strictly convex generating functions). InfoNCE-anchor, for example, is a cross-entropy (log-score) proper scoring rule, while other losses can be derived using alternative scoring functions (Ryu et al., 29 Oct 2025).

A plausible implication is that accurate MI estimation is neither necessary nor sufficient for superior representation learning performance; contrastive methods benefit predominantly from learning structured density ratios, not the exact I(X;Y)I(X;Y). Debiased InfoNCE is thus most crucial for tasks requiring valid mutual information measurement, statistical decision theory consistency, or fairness/robustness guarantees rather than representation utility per se.

7. Summary Table: Debiased InfoNCE Variants Across Modalities

Modality Debiasing Mechanism Key Theoretical Property
Mutual Information Est. Anchor class (InfoNCE-anchor) Fisher-consistent, unbiased MI estimate (Ryu et al., 29 Oct 2025)
Recommender Systems Analytic false-neg subtraction Unbiased empirical loss for positives/negatives (Jin et al., 2023, Li et al., 2023)
Graph Contrastive PU mining, score thresholding Density-ratio recovery; semantic pair correction (Wang et al., 7 May 2025)
Metric/Supervised Vision ε\varepsilon-margin, FairKL Robustness to dataset confounders (Barbano et al., 2022)

Taken together, debiased InfoNCE unifies contrastive and classification-based objectives under a principled framework, substantiating empirical and theoretical advances across information-theoretic, graph, supervised, and recommendation contexts.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Debiased InfoNCE.