EuCLIP: Euclidean Contrastive Pre-training
- EuCLIP is a contrastive language-image framework that leverages Euclidean geometry by discarding L2 normalization, enabling simpler models and hierarchical entailment via Euclidean cones.
- It replaces cosine similarity with negative squared Euclidean distance, which stabilizes training and yields performance comparable to or better than traditional CLIP and hyperbolic methods.
- Experimental results on ViT architectures demonstrate higher accuracy and retrieval metrics across diverse benchmarks, reinforcing its practical advantages in high-dimensional embedding spaces.
EuCLIP (Euclidean CLIP) is a contrastive language-image pre-training framework that utilizes a Euclidean embedding geometry rather than the normalized cosine similarity employed in the original CLIP architecture. By discarding the final L2 normalization and using negative (squared) Euclidean distance as the similarity metric, EuCLIP achieves improved or comparable performance to both CLIP and hyperbolic variants, while supporting hierarchical structures through Euclidean entailment cones. This approach leads to architectural simplification and, in high-dimensional settings, matches or exceeds the hierarchy-capturing power of more complex geometries (Chou et al., 2024).
1. Foundational Framework and InfoNCE Loss
Language-image contrastive pre-training, as popularized by CLIP, is defined over a mini-batch of paired texts and images. Encoders (text) and (image) output embeddings . The classic InfoNCE loss is symmetrized (image→text and text→image) and formulated as: CLIP employs L2 normalization for and , setting and . EuCLIP eliminates the normalization constraint and replaces the similarity function.
2. EuCLIP Embedding Geometry and Similarity Function
EuCLIP operates in without L2 normalization, using the following similarity/'logit' alternatives for contrastive learning:
- Negative Euclidean distance:
- Negative squared Euclidean distance:
Embeddings are first divided by to stabilize the expected magnitude. The squared-distance variant was found to stabilize training further and introduce implicit regularization on positive pairs. Otherwise, the optimization and loss remain structurally identical to CLIP.
3. Rationale for Euclidean Versus Alternative Geometries
The motivation for adopting Euclidean geometry stems from the limitations of both spherical (cosine) and hyperbolic alternatives:
- Spherical/cosine: Spherical geometry constrains all norms to unity, forming an elliptic manifold. The conventional “cosine distance” fails the triangle inequality, and norm invariance can be overly restrictive for hierarchical or modality-stratified representations.
- Hyperbolic (e.g., MERU): Hyperbolic approaches leverage the Lorentz hyperboloid, using negative Lorentzian distance logits and an explicit entailment loss for capturing hierarchy. However, the exponential volume growth property of hyperbolic space is effective only in low dimensions. As the embedding dimension grows (e.g., ), hyperbolicity loses practical benefit.
In contrast, high-dimensional Euclidean spaces support simple and expressive representation learning. Hierarchy is achieved through Euclidean entailment cones: For a generic concept ,
The exterior angle from to a specific is: A penalty of enforces entailment. This transitive loss matches or surpasses hyperbolic alternatives in capturing hierarchy.
4. Experimental Protocol and Implementation
Experiments utilized PyTorch 2.0+ and a fork of OpenCLIP. The vision backbones were ViT-B/32 and ViT-B/16, with the final layer normalization removed (Pre-LN transformer without final LN) as a critical architectural change. DataComp filtering track provided two dataset scales:
- Medium DataComp (~12M examples, 128M total samples seen): Filtering by CLIP-score (L/14 30%) and image clustering.
- Small DataComp (~1M examples): Used solely for hyperparameter tuning.
Training configuration included:
- Batch size: 4096 (accumulated)
- Optimizer: AdamW ()
- Learning rate schedule: cosine decay from with 500-step warm-up
- Logit-scale : learned in log-space, initialized at
- Entailment loss settings: ,
Zero-shot evaluation covered ImageNet, ImageNet-C/R shifts, 19 VTAB tasks, and image-text retrieval, totaling 38 tasks.
5. Quantitative Results and Comparative Analysis
In head-to-head comparatives with CLIP and MERU for ViT-B/16 at the medium data scale, EuCLIP (squared-distance logit, no final LN, , ) attained superior results on all evaluated metrics:
| Metric | EuCLIP | CLIP | MERU |
|---|---|---|---|
| ImageNet top-1 (%) | 35.17 | 34.73 | 33.84 |
| ImageNet shifts (%) | 27.7 | 27.2 | 26.2 |
| VTAB average (%) | 37.0 | 35.7 | 35.6 |
| Retrieval recall@1 (%) | 26.3 | 25.7 | 25.6 |
| 38-task average (%) | 35.8 | 34.9 | 34.2 |
This trend persisted at the ViT-B/32 scale, where MERU's learned curvature collapsed to the minimum clamp, consistent with the finding that pronounced hyperbolicity does not emerge in high-dimensional contrastive learning.
6. Ablation Studies and Norm Distribution Analysis
Key ablations established the essential role of architectural and loss-function choices:
- Final LayerNorm: Its removal is critical. Reintroducing the final layer normalization in EuCLIP (ViT-B/16) reduced ImageNet top-1 from 35.17% to 29.48%. The affine LN and projection restrict embedding power by confining outputs to a fixed hyper-ellipsoid, eliminating the norm degree of freedom exploited by EuCLIP.
- Squared-distance vs. distance: The logit produced more stable optimization and marginally higher accuracy, attributed to implicit regularization.
- Entailment loss: Including caused text embeddings to cluster near the origin and image embeddings to move outward. No spontaneous "modality gap" appeared without this loss. It provided slight improvements in zero-shot classification but not retrieval.
- Embedding-norm distributions: As visualized in the paper, concept specificity is stratified along embedding norm, supporting both generic and specific relationships without degrading alignment.
7. Implications and Prospective Directions
EuCLIP demonstrates that raw Euclidean embeddings—foregoing L2 normalization and using negative (squared) Euclidean logits—provide a straightforward, principled, and performant foundation for contrastive multimodal pre-training. The introduction of Euclidean entailment cones enables hierarchy modeling without complex curved-space embeddings, and high-dimensional Euclidean spaces are shown to avoid the limitations of hyperbolic geometry in contrastive applications. Additionally, the compatibility of Euclidean distances with optimized nearest-neighbor libraries (e.g., FAISS) underscores further practical advantage.
A plausible implication is that reevaluating long-assumed defaults—final layer normalization, embedding normalization, logit choice—may yield further simplification, scalability, and interpretability in large-scale multimodal architectures. The evidence presented suggests a trend toward adopting EuCLIP-style approaches for future developments in contrastive representation learning (Chou et al., 2024).