Effect of sum vs. mean reduction in supervised contrastive loss with variable instance sizes

Determine whether using the "sum" reduction in the supervised contrastive loss captures variation in instance sizes across batches and thereby explains the observed stability and performance gains relative to the "mean" reduction when training the multilingual Siamese two-tower embedding retriever on three-level e-commerce query–item relevance labels.

Background

The model is trained with a supervised contrastive (SupCon) objective adapted to three-level relevance labels for e-commerce query–item pairs. During training, the authors compared reduction strategies for aggregating loss across examples.

They report that using reduction=sum consistently yields more stable convergence and better performance than reduction=mean, and they explicitly conjecture that this is because reduction=sum better accounts for varied instance sizes across batches in their training setup.

References

Additionally, we observed consistent findings, in line with that reduction = sum has more stable convergence and better performance than reduction = mean and our conjecture is that reduction = sum captures the varied instance sizes across batches.

Mine and Refine: Optimizing Graded Relevance in E-commerce Search Retrieval  (2602.17654 - Xi et al., 19 Feb 2026) in Section 4.3 (Training: SupCon)