CLIP Image & Text Embeddings

Updated 30 January 2026

CLIP embeddings are a high-dimensional shared space computed by dual transformer encoders with L2 normalization to align image and text modalities.
They utilize a symmetric InfoNCE contrastive loss and cosine similarity to map images and descriptions for effective zero-shot classification and retrieval.
Enhancements like SToRI and LABCLIP enable fine-grained token control and improved compositional binding, boosting interpretability and benchmark performance.

CLIP (Contrastive Language–Image Pretraining) image and text embeddings constitute the core representational interface for vision–LLMs that align visual and textual modalities in a shared latent space. These embeddings enable zero-shot classification, retrieval, and image generation by mapping images and natural language descriptions into a high-dimensional joint embedding space using independently trained vision and text encoder networks. Both modalities are processed via transformer-based architectures, followed by L2 normalization, with similarity assessed primarily via cosine distance. The CLIP embedding paradigm has set the foundation for subsequent research in interpretable, controllable, and compositional multimodal representation, as well as large-scale open-domain applications across domains and languages.

1. CLIP Embedding Architecture and Mathematical Formulation

CLIP employs two independent neural encoders—a visual (e.g., Vision Transformer or ResNet) image encoder $f_\mathrm{img} : \mathbb{R}^{H \times W \times 3} \to \mathbb{R}^d$ and a text transformer encoder $f_\mathrm{txt} : \text{BPE}(T) \to \mathbb{R}^d$ —to produce fixed-length embeddings for images and texts, respectively (Schuhmann et al., 2021, Tzelepi et al., 2024, Schall et al., 2024). Preprocessing pipelines standardize image size, color normalization, and tokenization (usually BPE, up to 77 tokens for standard models).

The outputs are:

Image embedding: $\mathbf{x} = f_\mathrm{img}(I) \in \mathbb{R}^d$
Text embedding: $\mathbf{t} = f_\mathrm{txt}(T) \in \mathbb{R}^d$

Both are L2-normalized: $\hat{\mathbf{x}} = \frac{\mathbf{x}}{\|\mathbf{x}\|_2}, \qquad \hat{\mathbf{t}} = \frac{\mathbf{t}}{\|\mathbf{t}\|_2}$ Joint similarity between an image and a text is evaluated as

$\operatorname{sim}(I, T) = \hat{\mathbf{x}}^\top \hat{\mathbf{t}} \in [-1, 1]$

The core CLIP training objective is a symmetric InfoNCE contrastive loss: $\mathcal{L}_\mathrm{CLIP} = -\frac{1}{2N} \sum_{i=1}^N\left[ \log\frac{\exp(\operatorname{sim}(\mathbf{x}_i, \mathbf{t}_i)/\tau)}{\sum_{j=1}^N \exp(\operatorname{sim}(\mathbf{x}_i, \mathbf{t}_j)/\tau)} + \log\frac{\exp(\operatorname{sim}(\mathbf{x}_i, \mathbf{t}_i)/\tau)}{\sum_{j=1}^N \exp(\operatorname{sim}(\mathbf{x}_j, \mathbf{t}_i)/\tau)} \right]$ where $\tau$ is a learnable temperature parameter (Schuhmann et al., 2021, Tzelepi et al., 2024, Schall et al., 2024, Koukounas et al., 2024). This encourages matched pairs to have high cosine similarity and misaligned pairs to have low similarity.

Embedding dimensionality is typically 512 (ViT-B/32) or up to 1024 (jina-clip-v2), with architectures designed for efficient batched computation.

2. Semantic Structure and Geometry of the Embedding Space

Recent work has established that the unnormalized CLIP embeddings for images and texts occupy linearly separable, high-dimensional ellipsoidal shells not centered at the origin. Each modality cluster exhibits characteristic mean and covariance (Levi et al., 2024). The quadratic form of the ellipsoids is: $(x-m)^\top \Sigma^{-1}(x-m) = R^2$ where $m$ is the empirical mean and $\Sigma$ the covariance for either modality. Linear separability is evident: only a handful of dimensions are sufficient to distinguish images vs. texts. Conformity, defined as the average cosine similarity of an embedding to all others in the dataset ( $C(v) = \langle \cos(v, v_k) \rangle$ ), admits efficient estimation as $\cos(m, v)$ , with the Pearson correlation reported at 0.9998 on MS-COCO. The global modality gap (offset between mean image and mean text embedding) optimally matches the distributions of conformity for both modalities, achieving a statistically efficient tradeoff between alignment and uniformity in the NT-Xent loss (Levi et al., 2024).

3. Fine-Grained, Interpretable, and Controllable Embedding Manipulations

3.1. Token-Level Reweighting

Semantic Token Reweighting (SToRI) augments the standard CLIP text encoder by injecting context-dependent, non-negative per-token weights $w_i$ directly into the transformer self-attention computations. Reweighted attention at layer $l$ becomes: $\hat{A}_{mn} = \frac{w_n \exp(q_m \cdot k_n)}{\sum_{j} w_j \exp(q_m \cdot k_j)}$ where $w_n$ reflects the context- or user-defined importance of token $n$ (Kim et al., 2024). This controllable emphasis is injected from a specified transformer block onward and can be set either by optimization (data-driven, via backpropagation from downstream loss) or by explicit user input. Only log-weights $\alpha_i=\log w_i$ are learned, with no architectural modification beyond the per-token scalar.

SToRI delivers:

SOTA few-shot classification when only a few thousand $\alpha_i$ are optimized.
Direct user control in attribute-specific retrieval tasks (as on CelebA and CUB), yielding $2$– $3\%$ AP improvement when tokens are up- or down-weighted.
Interpretable per-token importance scores, with learned weights tracing to human-identifiable discriminative features.
No observed out-of-distribution shift in image–text alignment; embedding distributions remain well-calibrated.

3.2. Token-Level Classification Signals

SuperCLIP supplements the CLIP vision backbone with a lightweight linear classification head targeting all subword tokens in the caption, with inverse document-frequency weighting: $\mathcal{L}_\mathrm{Class} = -\sum_{c=1}^V \hat y_c \log p_c$ where $p_c$ are softmax probabilities over the token vocabulary and $\hat y_c$ are IDF-normalized K-hot targets from the caption (Zhao et al., 16 Dec 2025). This addition imposes fine-grained, dense textual supervision, resulting in enhanced visual–textual alignment, boosts in zero-shot accuracy, and greater robustness to small batch sizes, with negligible computational overhead ( $0.077\%$ additional FLOPs for ViT-L/16). SuperCLIP achieves consistent $+2$ – $5\%$ absolute accuracy/retrieval improvements on ImageNet and COCO/Flickr.

4. Alignment, Compositionality, and Binding in CLIP Embeddings

Standard cross-modal alignment in CLIP relies on cosine similarity between the aggregated representations of images and texts. Recent analysis demonstrates that this mechanism treats each modality almost as a bag of concepts, effectively flattening compositional binding and structural relations (Koishigarina et al., 5 Feb 2025):

Uni-modal (image-only or text-only) linear probes recover attribute–object bindings at near-perfect accuracy (>95%);
Cross-modal matching via cosine similarity fails to distinguish correct from permuted attribute–object bindings, leading to the bag-of-words phenomenon.

LABCLIP addresses this by introducing a learnable $D \times D$ linear transformation $W$ on the text side: $S_\mathrm{LAB}(x,y) = \cos(f_I(x), W f_T(y))$ and optimizing $W$ with contrastive loss using both correct and binding-permuted negative captions. This technique nearly entirely closes the compositional binding gap (recall@1 on synthetic datasets rises to $0.90$–$0.97$ vs. CLIP baseline of $0.06$–$0.36$) and consistently improves performance on real-world structured retrieval benchmarks (Koishigarina et al., 5 Feb 2025).

5. Practical Applications and Extensibility

5.1. Retrieval and Classification

CLIP embeddings have enabled large-scale, zero/few-shot retrieval and classification across modalities. Their scalable deployment is facilitated by publicly released datasets such as LAION-400M, which provides 400M CLIP-filtered image/text pairs and their 512D unit-normalized embeddings (Schuhmann et al., 2021). Practical recommendations include consistent normalization across queries and database entries, and using FAISS-based approximate nearest neighbor indices for efficient retrieval.

Recent work demonstrates that sequential fine-tuning strategies (2-Step Fine-Tuning, MCIP) can tailor CLIP models to either optimize image–image retrieval performance or preserve joint image–text alignment, balancing k-NN classification, zero-shot, and text-based retrieval capabilities (Schall et al., 2024).

5.2. Controlled and Interpretable Vision–Language Pipelines

Few-shot and user-preference-driven classification/retrieval: SToRI provides a mechanism to elicit and visualize which tokens dominate the text–image alignment process and to update retrieval priorities interactively (Kim et al., 2024).
Generalization to multilingual/multi-domain settings: Models such as jina-clip-v2 demonstrate flexible adaptation to 30+ languages, variable embedding dimensionality (truncated to 256D with <1% cross-modal degradation), and improved performance on visually-rich document retrieval and text-only tasks (Koukounas et al., 2024).
Medical multimodal alignment: M³Bind introduces a framework to align multiple medical image modalities (X-ray, CT, retina, ECG, pathology) via a shared text space, leveraging contrastive and MSE alignment losses and shared encoder distillation. This achieves robust zero/few-shot and cross-modal retrieval with minimal explicit paired data (Liu et al., 22 Jun 2025).
Text–tag semantic segmentation and bias mitigation: TTD (Text-Tag Self-Distillation) further refines text–image pixel-wise alignment at the tag level, addressing "single-tag bias" and improving open-vocabulary segmentation mIoU and interpretability (Jo et al., 2024).

5.3. Generation and Manipulation

unCLIP (two-stage generative pipeline) leverages the CLIP image–text space as an intermediate semantic representation for text-conditional image synthesis. Image generation proceeds by sampling an image embedding conditioned on a text embedding, then decoding into an image, thus retaining semantic diversity and fidelity (Ramesh et al., 2022).

6. Known Limitations and Open Directions

Embedding compositionality: Cosine similarity-based cross-modal matching disregards higher-order relationships, flattening structural bindings, though linear alignment methods (LABCLIP) can counteract this effect (Koishigarina et al., 5 Feb 2025).
Token importance expressivity: Reweighting effectiveness (SToRI) is limited for prompts lacking rich, discriminative semantic tokens (Kim et al., 2024).
Bias and calibration: Controlling token weights cannot remove all biases of CLIP's original pretraining.
Domain adaptation: Standard CLIP models underperform on "commentative" social media captions, but domain- and style-adapted dual-encoder variants (C-CLIP) can close the descriptive–commentative performance gap (Theisen et al., 2023).
Small batch training: Classic CLIP contrastive loss is vulnerable to effectiveness drops at reduced batch sizes, a challenge remediated by classification-augmented variants such as SuperCLIP (Zhao et al., 16 Dec 2025).

Open research directions highlighted include:

Region-level reweighting on the image side for spatially focused retrieval.
Integrating multi-modalities via unified or pivot representations.
Combining user-driven and data-driven embedding manipulations.
Explicitly modeling compositional structure in the contrastive objective.
Expanding coverage to new languages, modalities, and data types.

7. Quantitative Overview and Benchmarks

Recent publications provide detailed metrics for core and extended tasks:

Model / Method	Benchmark/Dataset	Metric / Value	Reference
CLIP (ViT-L/14)	ImageNet-1K	Zero-shot acc. 66.1% (SuperCLIP: 70.1%)	(Zhao et al., 16 Dec 2025)
	COCO R@1	32.7% (SuperCLIP: 35.9%)
SToRI	ImageNet, SUN397	1-shot accuracy > TaskRes, matches 2–16 shots	(Kim et al., 2024)
LABCLIP	CLEVR 2-object R@1	0.36 (baseline) → 0.93 (LABCLIP-HNB)	(Koishigarina et al., 5 Feb 2025)
jina-clip-v2	COCO multi-lingual	I→T R@5: 86.03%; T→I R@5: 84.87%	(Koukounas et al., 2024)
M³Bind	CheXpert 1-shot	acc: 88.07% (prev SOTA: 86.16%)	(Liu et al., 22 Jun 2025)
TTD (w/ TCL)	COCO-Object mIoU	55.0% → 61.1% w/ TTD tagging	(Jo et al., 2024)

Embedding space analysis confirms the existence of thin, non-centered ellipsoidal modality shells with easily separable directions and conformity-optimal centering (Levi et al., 2024).

References:

(Kim et al., 2024) SToRI: Semantic Token Reweighting for Interpretable and Controllable Text Embeddings in CLIP (Tzelepi et al., 2024) LMM-Regularized CLIP Embeddings for Image Classification (Ramesh et al., 2022) Hierarchical Text-Conditional Image Generation with CLIP Latents (Koishigarina et al., 5 Feb 2025) CLIP Behaves like a Bag-of-Words Model Cross-modally but not Uni-modally (Schuhmann et al., 2021) LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs (Schall et al., 2024) Optimizing CLIP Models for Image Retrieval with Maintained Joint-Embedding Alignment (Zhao et al., 16 Dec 2025) SuperCLIP: CLIP with Simple Classification Supervision (Theisen et al., 2023) C-CLIP: Contrastive Image-Text Encoders to Close the Descriptive-Commentative Gap (Liu et al., 22 Jun 2025) Multimodal Medical Image Binding via Shared Text Embeddings (Levi et al., 2024) The Double-Ellipsoid Geometry of CLIP (Koukounas et al., 2024) jina-clip-v2: Multilingual Multimodal Embeddings for Text and Images (Jo et al., 2024) TTD: Text-Tag Self-Distillation Enhancing Image-Text Alignment in CLIP to Alleviate Single Tag Bias (Ilyankou et al., 13 Jun 2025) CLIP the Landscape: Automated Tagging of Crowdsourced Landscape Images