- The paper demonstrates that full-batch contrastive learning achieves perfect positive alignment and uniform negative separation, while certain loss formulations can lead to excessive separation.
- The paper reveals that mini-batch training introduces higher variance in negative-pair similarities, with smaller batches exacerbating non-uniform separation.
- The paper proposes an auxiliary VRNS loss to regularize mini-batch negative pairs, thereby enhancing representation consistency and downstream performance.
Contrastive learning (CL) is a powerful technique for representation learning that trains models to pull embeddings of positive pairs (different views of the same instance) closer together and push embeddings of negative pairs (views from different instances) farther apart. This paper provides a unified framework for understanding CL objectives by analyzing the cosine similarity between embedding pairs in both full-batch and mini-batch training settings. It reveals limitations of existing methods, particularly in mini-batch scenarios, and proposes a practical auxiliary loss to mitigate these issues.
The paper considers two general forms of contrastive loss functions that encompass many existing methods like InfoNCE (Oord et al., 2018), SimCLR (Chen et al., 2020), DCL (Maya-Barbecho et al., 2022), DHEL (Zhao et al., 2024), SigLIP (Sinibaldi et al., 2023), and Spectral CL (Kimura et al., 2021). The analysis centers on the "positive-pair similarity" (cosine similarity between embeddings of a positive pair) and "negative-pair similarity" (cosine similarity between embeddings of a negative pair).
Behavior of Learned Embeddings
Full-Batch Contrastive Learning:
In a full-batch setting, where all n instances are used simultaneously for computing the loss, the paper shows (Theorem 1 (2506.09781)) that for a broad class of CL objectives, the optimal encoder f⋆ achieves perfect alignment for positive pairs, meaning their cosine similarity is 1, and uniform separation for negative pairs, with their cosine similarity equal to −n−11​. This generalizes previous findings on the optimal structure (e.g., Simplex Equiangular Tight Frame) (Fontanela et al., 2020, Zhao et al., 2024, Heimbach et al., 2024).
However, the paper identifies a potential issue called "excessive separation" in full-batch CL. Theorem 2 (2506.09781) shows a relationship: E[(f;p^​pos​)]≤1+(E[(f;p^​neg​)]+n−11​). This implies that if the average negative-pair similarity drops below the optimal −n−11​ threshold, perfect alignment of positive pairs (E[(f;p^​pos​)]=1) becomes unattainable. Theorem 3 (2506.09781) proves that this excessive separation can occur in certain loss functions, like the independently additive form (Definition 2) with only cross-view negatives (c1​=1,c2​=0) if the loss prioritizes pushing negatives further apart over maintaining positive alignment.
The SigLIP loss (Sinibaldi et al., 2023) is given as an example where, depending on hyperparameters (specifically, if the bias term b is sufficiently small), the optimal solution can suffer from excessive separation, leading to positive pairs not being perfectly aligned and negative pairs being pushed beyond −n−11​ similarity.
Practical Implication for Full-Batch: To avoid excessive separation in full-batch settings, especially for independently additive losses, the paper suggests including "within-view" negative pairs (c2​=1) in the loss computation. This structural change in the loss function helps maintain the desired embedding properties without complex hyperparameter tuning.
Mini-Batch Contrastive Learning:
In practical mini-batch training (with batch size m<n), samples are grouped into batches, and the loss is computed based on pairs within that batch. Theorem 4 (2506.09781) shows that for optimal embeddings minimizing the sum of per-batch losses, positive pairs still achieve perfect alignment, and the expected negative-pair similarity across the entire dataset is still −n−11​. However, unlike full-batch, the variance of negative-pair similarities is positive when m<n. This means negative pairs are not uniformly separated; some are closer, and some are farther apart than the theoretical optimum. The variance is proven to be lower-bounded by (m−1)(n−1)2n−m​ and upper-bounded by (m−1)(n−1)2n(n−m)​. This variance is a monotonically decreasing function of the batch size m, indicating that smaller batches exacerbate this non-uniform separation.
Figure 1 visually illustrates this: while the average negative similarity might be the same across full-batch and mini-batch training, the mini-batch case shows higher variance, leading to some negative pairs being much closer together (undesirable) and others much farther apart.
Theorem 5 (2506.09781) provides further insight by analyzing the gradient of the InfoNCE loss with respect to negative pair similarities. It shows that the magnitude of this gradient, which drives negative pairs apart, is larger for smaller batch sizes. This confirms that smaller batches lead to stronger separation among within-batch negative pairs.
Practical Solution for Mini-Batch:
To address the increased variance of negative-pair similarities in mini-batch training, the paper proposes an auxiliary loss term (Definition 4 (2506.09781)):
LVRNS​(U[m]​,V[m]​):=m(m−1)1​iî€ =j∈[m]∑​(ui⊤​vj​+n−11​)2
This term penalizes deviations of cross-view negative pair similarities (ui⊤​vj​ for iî€ =j) from the target value of −n−11​. It can be added to any standard mini-batch contrastive loss: Ltotal​=Lbaseline​+λ⋅LVRNS​, where λ>0 is a hyperparameter.
Implementation of LVRNS​:
The LVRNS​ term can be implemented efficiently using matrix operations. Given mini-batch embeddings U and V of shape (m,d):
- Calculate the similarity matrix S=UV⊤, which is shape (m,m).
- The elements Sij​=ui⊤​vj​ for iî€ =j are the similarities of the cross-view negative pairs within the batch.
- Calculate the squared difference from the target similarity: (Sij​−(−n−11​))2 for all iî€ =j.
- Sum these squared differences and divide by the number of such pairs, m(m−1).
- Multiply by the hyperparameter λ.
Here is a conceptual Python-like pseudocode snippet:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
|
import torch
def calculate_vrns_loss(embeddings_u, embeddings_v, total_dataset_size, lambda_reg):
"""
Calculates the Variance Reduction of Negative-Pair Similarities (VRNS) loss.
Args:
embeddings_u: Tensor of shape (batch_size, embedding_dim) for view U.
embeddings_v: Tensor of shape (batch_size, embedding_dim) for view V.
total_dataset_size: Total number of instances (N).
lambda_reg: Regularization weight (lambda).
Returns:
The weighted VRNS loss term.
"""
batch_size = embeddings_u.shape[0]
if batch_size < 2:
return torch.tensor(0.0) # Cannot form negative pairs
optimal_neg_sim = -1.0 / (total_dataset_size - 1) # Target similarity
# Calculate all pairwise cross-view similarities within the batch
sim_matrix = torch.matmul(embeddings_u, embeddings_v.transpose(0, 1)) # shape (batch_size, batch_size)
# Create a mask for non-diagonal elements (negative pairs u_i, v_j with i != j)
identity_mask = torch.eye(batch_size, device=embeddings_u.device).bool()
negative_sims = sim_matrix[~identity_mask] # shape (batch_size * (batch_size - 1))
# Calculate the squared difference from the optimal negative similarity
squared_diffs = (negative_sims - optimal_neg_sim)**2
# Average over all negative pairs within the batch
mean_squared_diff = torch.mean(squared_diffs)
return lambda_reg * mean_squared_diff
|
Empirical Validation:
The paper validates its findings and the effectiveness of the proposed auxiliary loss through experiments on CIFAR-10, CIFAR-100, and ImageNet.
- Excessive Separation Observed: Table 1 (2506.09781) shows that the variance of negative-pair similarities in SimCLR trained models on CIFAR-100 increases as batch size decreases (from 512 down to 32), confirming the theoretical result in Theorem 4. Training with the proposed LVRNS​ loss significantly reduces this variance across all batch sizes.
- Performance Gains: Figure 2 (2506.09781) demonstrates that incorporating the LVRNS​ term into SimCLR improves downstream classification accuracy on CIFAR datasets and makes performance less sensitive to the temperature hyperparameter. Figure 3 (2506.09781) shows consistent performance gains when LVRNS​ is combined with various baseline methods (SimCLR, DCL, DHEL) on CIFAR datasets, with more pronounced improvements at smaller batch sizes. Additional results in Appendix C confirm gains on ImageNet-100 and the full ImageNet dataset.
Implementation Considerations:
- Computational Cost: Adding the LVRNS​ term involves computing all pairwise cross-view similarities within a mini-batch, which can add a small computational overhead compared to baselines that might not explicitly sum over all negative pairs (though many InfoNCE-based losses do this anyway).
- Hyperparameter Tuning: The weighting factor λ for the LVRNS​ term needs to be tuned (the paper explores {0.1,0.3,1,3,10,30,100}).
- Potential Limitations: As discussed in Appendix D (2506.09781), the LVRNS​ loss aims to reduce variance towards a uniform target. While beneficial for mitigating mini-batch effects, it could potentially suppress variance that captures meaningful semantic structure in the data. Its impact is less significant with very large batch sizes where the variance issue is less severe naturally.
In summary, the paper provides a valuable theoretical perspective on CL by focusing on embedding similarities. It formally proves that mini-batch training inherently introduces non-uniformity (higher variance) in negative-pair similarities, especially with small batch sizes, which can degrade performance. The proposed auxiliary LVRNS​ loss offers a practical way to counteract this by regularizing negative-pair similarities towards the theoretically optimal uniform separation, leading to improved and more robust representations in small-batch scenarios.