On the Similarities of Embeddings in Contrastive Learning

Published 11 Jun 2025 in cs.LG and stat.ML | (2506.09781v1)

Abstract: Contrastive learning (CL) operates on a simple yet effective principle: embeddings of positive pairs are pulled together, while those of negative pairs are pushed apart. Although various forms of contrastive loss have been proposed and analyzed from different perspectives, prior works lack a comprehensive framework that systematically explains a broad class of these objectives. In this paper, we present a unified framework for understanding CL, which is based on analyzing the cosine similarity between embeddings of positive and negative pairs. In full-batch settings, we show that perfect alignment of positive pairs is unattainable when similarities of negative pairs fall below a certain threshold, and that this misalignment can be alleviated by incorporating within-view negative pairs. In mini-batch settings, we demonstrate that smaller batch sizes incur stronger separation among negative pairs within batches, which leads to higher variance in similarities of negative pairs. To address this limitation of mini-batch CL, we introduce an auxiliary loss term that reduces the variance of similarities of negative pairs in CL. Empirical results demonstrate that incorporating the proposed loss consistently improves the performance of CL methods in small-batch training.

Abstract PDF Upgrade to Chat

Summary

The paper demonstrates that full-batch contrastive learning achieves perfect positive alignment and uniform negative separation, while certain loss formulations can lead to excessive separation.
The paper reveals that mini-batch training introduces higher variance in negative-pair similarities, with smaller batches exacerbating non-uniform separation.
The paper proposes an auxiliary VRNS loss to regularize mini-batch negative pairs, thereby enhancing representation consistency and downstream performance.

Contrastive learning (CL) is a powerful technique for representation learning that trains models to pull embeddings of positive pairs (different views of the same instance) closer together and push embeddings of negative pairs (views from different instances) farther apart. This paper provides a unified framework for understanding CL objectives by analyzing the cosine similarity between embedding pairs in both full-batch and mini-batch training settings. It reveals limitations of existing methods, particularly in mini-batch scenarios, and proposes a practical auxiliary loss to mitigate these issues.

The paper considers two general forms of contrastive loss functions that encompass many existing methods like InfoNCE (Oord et al., 2018), SimCLR (Chen et al., 2020), DCL (Maya-Barbecho et al., 2022), DHEL (Zhao et al., 2024), SigLIP (Sinibaldi et al., 2023), and Spectral CL (Kimura et al., 2021). The analysis centers on the "positive-pair similarity" (cosine similarity between embeddings of a positive pair) and "negative-pair similarity" (cosine similarity between embeddings of a negative pair).

Behavior of Learned Embeddings

Full-Batch Contrastive Learning:

In a full-batch setting, where all $n$ instances are used simultaneously for computing the loss, the paper shows (Theorem 1 (2506.09781)) that for a broad class of CL objectives, the optimal encoder $f^\star$ achieves perfect alignment for positive pairs, meaning their cosine similarity is 1, and uniform separation for negative pairs, with their cosine similarity equal to $-\frac{1}{n-1}$ . This generalizes previous findings on the optimal structure (e.g., Simplex Equiangular Tight Frame) (Fontanela et al., 2020, Zhao et al., 2024, Heimbach et al., 2024).

However, the paper identifies a potential issue called "excessive separation" in full-batch CL. Theorem 2 (2506.09781) shows a relationship: $E[(f; \hat{p}_{\mathrm{pos}})] \leq 1 + (E[(f; \hat{p}_{\mathrm{neg}})] + \frac{1}{n-1})$ . This implies that if the average negative-pair similarity drops below the optimal $-\frac{1}{n-1}$ threshold, perfect alignment of positive pairs ( $E[(f; \hat{p}_{\mathrm{pos}})] = 1$ ) becomes unattainable. Theorem 3 (2506.09781) proves that this excessive separation can occur in certain loss functions, like the independently additive form (Definition 2) with only cross-view negatives ( $c_1=1, c_2=0$ ) if the loss prioritizes pushing negatives further apart over maintaining positive alignment.

The SigLIP loss (Sinibaldi et al., 2023) is given as an example where, depending on hyperparameters (specifically, if the bias term b is sufficiently small), the optimal solution can suffer from excessive separation, leading to positive pairs not being perfectly aligned and negative pairs being pushed beyond $-\frac{1}{n-1}$ similarity.

Practical Implication for Full-Batch: To avoid excessive separation in full-batch settings, especially for independently additive losses, the paper suggests including "within-view" negative pairs ( $c_2=1$ ) in the loss computation. This structural change in the loss function helps maintain the desired embedding properties without complex hyperparameter tuning.

Mini-Batch Contrastive Learning:

In practical mini-batch training (with batch size $m < n$ ), samples are grouped into batches, and the loss is computed based on pairs within that batch. Theorem 4 (2506.09781) shows that for optimal embeddings minimizing the sum of per-batch losses, positive pairs still achieve perfect alignment, and the expected negative-pair similarity across the entire dataset is still $-\frac{1}{n-1}$ . However, unlike full-batch, the variance of negative-pair similarities is positive when $m < n$ . This means negative pairs are not uniformly separated; some are closer, and some are farther apart than the theoretical optimum. The variance is proven to be lower-bounded by $\frac{n-m}{(m-1)(n-1)^2}$ and upper-bounded by $\frac{n(n-m)}{(m-1)(n-1)^2}$ . This variance is a monotonically decreasing function of the batch size $m$ , indicating that smaller batches exacerbate this non-uniform separation.

Figure 1 visually illustrates this: while the average negative similarity might be the same across full-batch and mini-batch training, the mini-batch case shows higher variance, leading to some negative pairs being much closer together (undesirable) and others much farther apart.

Theorem 5 (2506.09781) provides further insight by analyzing the gradient of the InfoNCE loss with respect to negative pair similarities. It shows that the magnitude of this gradient, which drives negative pairs apart, is larger for smaller batch sizes. This confirms that smaller batches lead to stronger separation among within-batch negative pairs.

Practical Solution for Mini-Batch:

To address the increased variance of negative-pair similarities in mini-batch training, the paper proposes an auxiliary loss term (Definition 4 (2506.09781)):

$L_{VRNS}(\mathbf{U}_{[m]}, \mathbf{V}_{[m]}) := \frac{1}{m(m-1)} \sum_{i \neq j \in [m]} (\mathbf{u}_i^\top \mathbf{v}_j + \frac{1}{n-1})^2$

This term penalizes deviations of cross-view negative pair similarities $(\mathbf{u}_i^\top \mathbf{v}_j$ for $i \neq j)$ from the target value of $-\frac{1}{n-1}$ . It can be added to any standard mini-batch contrastive loss: $L_{total} = L_{baseline} + \lambda \cdot L_{VRNS}$ , where $\lambda > 0$ is a hyperparameter.

Implementation of $L_{VRNS}$ :

The $L_{VRNS}$ term can be implemented efficiently using matrix operations. Given mini-batch embeddings $\mathbf{U}$ and $\mathbf{V}$ of shape $(m, d)$ :

Calculate the similarity matrix $\mathbf{S} = \mathbf{U}\mathbf{V}^\top$ , which is shape $(m, m)$ .
The elements $\mathbf{S}_{ij} = \mathbf{u}_i^\top \mathbf{v}_j$ for $i \neq j$ are the similarities of the cross-view negative pairs within the batch.
Calculate the squared difference from the target similarity: $(\mathbf{S}_{ij} - (-\frac{1}{n-1}))^2$ for all $i \neq j$ .
Sum these squared differences and divide by the number of such pairs, $m(m-1)$ .
Multiply by the hyperparameter $\lambda$ .

Here is a conceptual Python-like pseudocode snippet:

import torch

def calculate_vrns_loss(embeddings_u, embeddings_v, total_dataset_size, lambda_reg):
    """
    Calculates the Variance Reduction of Negative-Pair Similarities (VRNS) loss.

    Args:
        embeddings_u: Tensor of shape (batch_size, embedding_dim) for view U.
        embeddings_v: Tensor of shape (batch_size, embedding_dim) for view V.
        total_dataset_size: Total number of instances (N).
        lambda_reg: Regularization weight (lambda).

    Returns:
        The weighted VRNS loss term.
    """
    batch_size = embeddings_u.shape[0]
    if batch_size < 2:
        return torch.tensor(0.0) # Cannot form negative pairs

    optimal_neg_sim = -1.0 / (total_dataset_size - 1) # Target similarity

    # Calculate all pairwise cross-view similarities within the batch
    sim_matrix = torch.matmul(embeddings_u, embeddings_v.transpose(0, 1)) # shape (batch_size, batch_size)

    # Create a mask for non-diagonal elements (negative pairs u_i, v_j with i != j)
    identity_mask = torch.eye(batch_size, device=embeddings_u.device).bool()
    negative_sims = sim_matrix[~identity_mask] # shape (batch_size * (batch_size - 1))

    # Calculate the squared difference from the optimal negative similarity
    squared_diffs = (negative_sims - optimal_neg_sim)**2

    # Average over all negative pairs within the batch
    mean_squared_diff = torch.mean(squared_diffs)

    return lambda_reg * mean_squared_diff

Empirical Validation:

The paper validates its findings and the effectiveness of the proposed auxiliary loss through experiments on CIFAR-10, CIFAR-100, and ImageNet.

Excessive Separation Observed: Table 1 (2506.09781) shows that the variance of negative-pair similarities in SimCLR trained models on CIFAR-100 increases as batch size decreases (from 512 down to 32), confirming the theoretical result in Theorem 4. Training with the proposed $L_{VRNS}$ loss significantly reduces this variance across all batch sizes.
Performance Gains: Figure 2 (2506.09781) demonstrates that incorporating the $L_{VRNS}$ term into SimCLR improves downstream classification accuracy on CIFAR datasets and makes performance less sensitive to the temperature hyperparameter. Figure 3 (2506.09781) shows consistent performance gains when $L_{VRNS}$ is combined with various baseline methods (SimCLR, DCL, DHEL) on CIFAR datasets, with more pronounced improvements at smaller batch sizes. Additional results in Appendix C confirm gains on ImageNet-100 and the full ImageNet dataset.

Implementation Considerations:

Computational Cost: Adding the $L_{VRNS}$ term involves computing all pairwise cross-view similarities within a mini-batch, which can add a small computational overhead compared to baselines that might not explicitly sum over all negative pairs (though many InfoNCE-based losses do this anyway).
Hyperparameter Tuning: The weighting factor $\lambda$ for the $L_{VRNS}$ term needs to be tuned (the paper explores $\{0.1, 0.3, 1, 3, 10, 30, 100\}$ ).
Potential Limitations: As discussed in Appendix D (2506.09781), the $L_{VRNS}$ loss aims to reduce variance towards a uniform target. While beneficial for mitigating mini-batch effects, it could potentially suppress variance that captures meaningful semantic structure in the data. Its impact is less significant with very large batch sizes where the variance issue is less severe naturally.

In summary, the paper provides a valuable theoretical perspective on CL by focusing on embedding similarities. It formally proves that mini-batch training inherently introduces non-uniformity (higher variance) in negative-pair similarities, especially with small batch sizes, which can degrade performance. The proposed auxiliary $L_{VRNS}$ loss offers a practical way to counteract this by regularizing negative-pair similarities towards the theoretically optimal uniform separation, leading to improved and more robust representations in small-batch scenarios.