Papers
Topics
Authors
Recent
Search
2000 character limit reached

EuCLIP: Euclidean Contrastive Pre-training

Updated 3 February 2026
  • EuCLIP is a contrastive language-image framework that leverages Euclidean geometry by discarding L2 normalization, enabling simpler models and hierarchical entailment via Euclidean cones.
  • It replaces cosine similarity with negative squared Euclidean distance, which stabilizes training and yields performance comparable to or better than traditional CLIP and hyperbolic methods.
  • Experimental results on ViT architectures demonstrate higher accuracy and retrieval metrics across diverse benchmarks, reinforcing its practical advantages in high-dimensional embedding spaces.

EuCLIP (Euclidean CLIP) is a contrastive language-image pre-training framework that utilizes a Euclidean embedding geometry rather than the normalized cosine similarity employed in the original CLIP architecture. By discarding the final L2 normalization and using negative (squared) Euclidean distance as the similarity metric, EuCLIP achieves improved or comparable performance to both CLIP and hyperbolic variants, while supporting hierarchical structures through Euclidean entailment cones. This approach leads to architectural simplification and, in high-dimensional settings, matches or exceeds the hierarchy-capturing power of more complex geometries (Chou et al., 2024).

1. Foundational Framework and InfoNCE Loss

Language-image contrastive pre-training, as popularized by CLIP, is defined over a mini-batch B={(Ti,Ii)}i=1B\mathcal{B} = \{(T_i, I_i)\}_{i=1}^{|\mathcal{B}|} of paired texts and images. Encoders ff (text) and gg (image) output embeddings ui=f(Ti), vi=g(Ii)u_i = f(T_i),\ v_i = g(I_i). The classic InfoNCE loss is symmetrized (image→text and text→image) and formulated as: Lcont=12Bi=1B[logexp(βsim(ui,vi))j=1Bexp(βsim(uj,vi))+logexp(βsim(ui,vi))j=1Bexp(βsim(ui,vj))]\mathcal{L}_{\mathrm{cont}} = -\frac{1}{2|\mathcal{B}|} \sum_{i=1}^{|\mathcal{B}|} \left[ \log \frac{ \exp\left(\beta\,\mathrm{sim}(u_i, v_i)\right) }{ \sum_{j=1}^{|\mathcal{B}|} \exp\left(\beta\,\mathrm{sim}(u_j, v_i)\right)} + \log \frac{ \exp\left(\beta\,\mathrm{sim}(u_i, v_i)\right) }{ \sum_{j=1}^{|\mathcal{B}|} \exp\left(\beta\,\mathrm{sim}(u_i, v_j)\right) } \right] CLIP employs L2 normalization for uu and vv, setting u=v=1\|u\| = \|v\| = 1 and simcos(u,v)=uvuv\mathrm{sim}_{\cos}(u,v) = \frac{u \cdot v}{\|u\|\|v\|}. EuCLIP eliminates the normalization constraint and replaces the similarity function.

2. EuCLIP Embedding Geometry and Similarity Function

EuCLIP operates in Rn\mathbb{R}^n without L2 normalization, using the following similarity/'logit' alternatives for contrastive learning:

  • Negative Euclidean distance:

simd(u,v)=1nuv\mathrm{sim}_{d}(u, v) = -\frac{1}{\sqrt{n}} \|u-v\|

  • Negative squared Euclidean distance:

simd2(u,v)=1nuv2\mathrm{sim}_{d^2}(u, v) = -\frac{1}{n} \|u-v\|^2

Embeddings are first divided by n\sqrt{n} to stabilize the expected magnitude. The squared-distance variant was found to stabilize training further and introduce implicit L2L_2 regularization on positive pairs. Otherwise, the optimization and loss remain structurally identical to CLIP.

3. Rationale for Euclidean Versus Alternative Geometries

The motivation for adopting Euclidean geometry stems from the limitations of both spherical (cosine) and hyperbolic alternatives:

  • Spherical/cosine: Spherical geometry constrains all norms to unity, forming an elliptic Sn1S^{n-1} manifold. The conventional “cosine distance” 1cos1-\cos fails the triangle inequality, and norm invariance can be overly restrictive for hierarchical or modality-stratified representations.
  • Hyperbolic (e.g., MERU): Hyperbolic approaches leverage the Lorentz hyperboloid, using negative Lorentzian distance logits and an explicit entailment loss for capturing hierarchy. However, the exponential volume growth property of hyperbolic space is effective only in low dimensions. As the embedding dimension grows (e.g., n512n \approx 512), hyperbolicity loses practical benefit.

In contrast, high-dimensional Euclidean spaces support simple and expressive representation learning. Hierarchy is achieved through Euclidean entailment cones: For a generic concept xx,

aper(x)=sin1(min{1,K/x})\mathrm{aper}(x) = \sin^{-1} \left(\min\{1, K/\|x\|\}\right)

The exterior angle from xx to a specific yy is: ext(x,y)=πcos1((yx)xyx  x)\mathrm{ext}(x, y) = \pi - \cos^{-1} \left( \frac{(y-x) \cdot x}{\|y-x\|\;\|x\|} \right) A penalty of max(0,ext(x,y)aper(x))\max(0, \mathrm{ext}(x, y) - \mathrm{aper}(x)) enforces entailment. This transitive loss matches or surpasses hyperbolic alternatives in capturing hierarchy.

4. Experimental Protocol and Implementation

Experiments utilized PyTorch 2.0+ and a fork of OpenCLIP. The vision backbones were ViT-B/32 and ViT-B/16, with the final layer normalization removed (Pre-LN transformer without final LN) as a critical architectural change. DataComp filtering track provided two dataset scales:

  • Medium DataComp (~12M examples, 128M total samples seen): Filtering by CLIP-score (L/14 30%) and image clustering.
  • Small DataComp (~1M examples): Used solely for hyperparameter tuning.

Training configuration included:

  • Batch size: 4096 (accumulated)
  • Optimizer: AdamW (β2=0.98\beta_2 = 0.98)
  • Learning rate schedule: cosine decay from 5×1045 \times 10^{-4} with 500-step warm-up
  • Logit-scale β\beta: learned in log-space, initialized at log(1/0.07)\log(1/0.07)
  • Entailment loss settings: K=0.3K = 0.3, λ=0.1\lambda = 0.1

Zero-shot evaluation covered ImageNet, ImageNet-C/R shifts, 19 VTAB tasks, and image-text retrieval, totaling 38 tasks.

5. Quantitative Results and Comparative Analysis

In head-to-head comparatives with CLIP and MERU for ViT-B/16 at the medium data scale, EuCLIP (squared-distance logit, no final LN, K=0.3K = 0.3, λ=0.1\lambda = 0.1) attained superior results on all evaluated metrics:

Metric EuCLIP CLIP MERU
ImageNet top-1 (%) 35.17 34.73 33.84
ImageNet shifts (%) 27.7 27.2 26.2
VTAB average (%) 37.0 35.7 35.6
Retrieval recall@1 (%) 26.3 25.7 25.6
38-task average (%) 35.8 34.9 34.2

This trend persisted at the ViT-B/32 scale, where MERU's learned curvature cc collapsed to the minimum clamp, consistent with the finding that pronounced hyperbolicity does not emerge in high-dimensional contrastive learning.

6. Ablation Studies and Norm Distribution Analysis

Key ablations established the essential role of architectural and loss-function choices:

  • Final LayerNorm: Its removal is critical. Reintroducing the final layer normalization in EuCLIP (ViT-B/16) reduced ImageNet top-1 from 35.17% to 29.48%. The affine LN and projection restrict embedding power by confining outputs to a fixed hyper-ellipsoid, eliminating the norm degree of freedom exploited by EuCLIP.
  • Squared-distance vs. distance: The d2d^2 logit produced more stable optimization and marginally higher accuracy, attributed to implicit L2L_2 regularization.
  • Entailment loss: Including Lentail\mathcal{L}_{\mathrm{entail}} caused text embeddings to cluster near the origin and image embeddings to move outward. No spontaneous "modality gap" appeared without this loss. It provided slight improvements in zero-shot classification but not retrieval.
  • Embedding-norm distributions: As visualized in the paper, concept specificity is stratified along embedding norm, supporting both generic and specific relationships without degrading alignment.

7. Implications and Prospective Directions

EuCLIP demonstrates that raw Euclidean embeddings—foregoing L2 normalization and using negative (squared) Euclidean logits—provide a straightforward, principled, and performant foundation for contrastive multimodal pre-training. The introduction of Euclidean entailment cones enables hierarchy modeling without complex curved-space embeddings, and high-dimensional Euclidean spaces are shown to avoid the limitations of hyperbolic geometry in contrastive applications. Additionally, the compatibility of Euclidean distances with optimized nearest-neighbor libraries (e.g., FAISS) underscores further practical advantage.

A plausible implication is that reevaluating long-assumed defaults—final layer normalization, embedding normalization, logit choice—may yield further simplification, scalability, and interpretability in large-scale multimodal architectures. The evidence presented suggests a trend toward adopting EuCLIP-style approaches for future developments in contrastive representation learning (Chou et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to EuCLIP.