Papers
Topics
Authors
Recent
Search
2000 character limit reached

Semantic-Enhanced CLIP (SeeCLIP)

Updated 15 January 2026
  • The paper presents semantic-enhanced CLIP that employs prompt enhancement, semantic tokenization, and margin-based duplex contrastive learning to improve open-set recognition (e.g., +5% H-score boost).
  • It leverages adaptive knowledge transfer with textual priors and semantic-guided diffusion to bridge modality gaps and mitigate catastrophic forgetting in continual learning.
  • Additionally, SeeCLIP achieves significant efficiency in zero-shot semantic communication, realizing up to 55–92× bandwidth reduction while maintaining high accuracy over baseline CLIP models.

Semantic-Enhanced CLIP (SeeCLIP) refers to a family of methods and frameworks that augment and adapt the original Contrastive Language-Image Pretraining (CLIP) model to endow it with deeper semantic understanding, finer-grained alignment, structural robustness, and broader task flexibility. These enhancements target limitations of vanilla CLIP in settings such as open-set domain generalization, continual learning, semantic communication, and dense prediction. Typical techniques involve prompt engineering, semantic tokenization, diffusion-based hard negative synthesis, adaptive channel coding, and textual-informed prototype refinement. Recent work anchors SeeCLIP as a flexible paradigm for bridging granular semantics, cross-domain transfer, and multi-modal communication in CLIP-based systems (Wang et al., 21 Nov 2025, He et al., 3 Aug 2025, Hu et al., 25 Feb 2025).

1. Motivation and Scope

CLIP provides powerful cross-modal representations by aligning images and texts in a joint embedding space. However, several recognized challenges motivate semantic enhancement:

  • Structural risk–open-space risk trade-offs in open-set recognition, where CLIP tends to overfit to known classes yet misclassifies unknowns.
  • Modality gaps: the generalizability of text-only CLIP classifiers is offset by limited plasticity, while naïve visual classifiers lack semantic richness.
  • Task-specific inflexibility, such as fixed class vocabularies, and sensitivity to domain shift or channel noise.
  • Underperformance in fine-grained or dense tasks where patch- or token-level semantic structures are not explicitly modeled.

Semantic-Enhanced CLIP approaches systematically address these deficits by introducing modules for prompt enrichment, semantic token decomposition, joint source-channel coding, and knowledge transfer mechanisms (Wang et al., 21 Nov 2025, He et al., 3 Aug 2025, Hu et al., 25 Feb 2025).

2. Semantic-Aware Prompt Enhancement and Tokenization

A core element of SeeCLIP is the explicit extraction and utilization of fine-grained semantic tokens from CLIP’s vision encoder. Instead of relying solely on global embeddings, SeeCLIP decomposes image representations via learnable queries applied to patch features from CLIP’s ViT backbone:

ωi(k)=exp(q(k)fi)j=1Nexp(q(k)fj)\omega_i^{(k)} = \frac{\exp\left(q^{(k)} \cdot f_i\right)}{\sum_{j=1}^N \exp\left(q^{(k)} \cdot f_j\right)}

vsem(k)=i=1Nωi(k)fiv_\text{sem}^{(k)} = \sum_{i=1}^N \omega_i^{(k)} f_i

where fif_i are patch embeddings, q(k)q^{(k)} are learnable queries, and vsem(k)v_\text{sem}^{(k)} are the resulting semantic tokens. For known classes, prompts integrate these tokens along with class names and domain-average features:

pc=[Φ(vdom),Ψ1(vsem(1)),,ΨK(vsem(K)),[classname]]p_c = \left[\, \Phi(v_\text{dom}), \Psi_1(v_\text{sem}^{(1)}), \dots, \Psi_K(v_\text{sem}^{(K)}),\, \texttt{[classname]}\, \right]

Unknown-class prompts omit the class name and use class-agnostic vectors and an [unknown] token. This prompt enrichment enables far more nuanced vision-language alignment, especially important under open-set or domain-shifted conditions (Wang et al., 21 Nov 2025).

3. Margin-Based Duplex Contrastive Learning and Diffusion Hard Negatives

To explicitly control the spatial arrangement of known and unknown classes in the embedding space, SeeCLIP frameworks introduce duplex contrastive learning:

  • Repulsion loss ensures the unknown prompt is sufficiently distant from known class samples:

Lrep=c=1Cmax(0,δsim(Ft(punk),Fv(Xc)))\mathcal{L}_\text{rep} = \sum_{c=1}^C \max\left(0, \delta - \text{sim}(F_t(p_\text{unk}), F_v(X_c))\right)

  • Cohesion loss prevents the unknown prompt from drifting into semantically irrelevant regions, by anchoring it near the centroid of known prompts:

Lcoh=Ft(punk)1Cc=1CFt(pc)22\mathcal{L}_\text{coh} = \left\| F_t(p_\text{unk}) - \frac{1}{C}\sum_{c=1}^{C} F_t(p_c) \right\|_2^2

Complementary to these objectives, a semantic-guided diffusion module synthesizes pseudo-unknown images by perturbing semantic tokens and rerendering images using a latent diffusion model. This exposes the model to “hard unknowns”–samples that resemble known classes but manifest semantic edge cases—a critical mechanism for fine-tuning open-space risk boundaries (Wang et al., 21 Nov 2025).

4. Semantic-Guided Adaptive Knowledge Transfer in Continual Learning

Semantic enhancement in continual learning leverages textual priors to alleviate catastrophic forgetting and support incremental class injection. The Semantic-Enriched Continual Adaptation (SECA) framework provides a cross-modal, selective distillation approach:

  • A pool of historical visual adapters encodes past knowledge.
  • For each incoming sample, textual embeddings assess the semantic alignment to each adapter using projectors WSW_S and WVW_V. Instance-relevant adapters are assigned higher aggregation weights:

αx(p)=1si=1s[ϕ(Sy(i))WS][ϕ(Vx(p))WV]\alpha_x^{(p)} = \frac{1}{s} \sum_{i=1}^s [\phi(S_y^{(i)}) W_S]^\top [\phi(V_x^{(p)}) W_V]

wp=exp(λαx(p))qexp(λαx(q))w_p = \frac{\exp(\lambda \alpha_x^{(p)})}{\sum_{q} \exp(\lambda \alpha_x^{(q)})}

Vxagg=pwpVx(p)V_x^\text{agg} = \sum_p w_p V_x^{(p)}

  • The student model is then trained to mimic only the relevant, textually-attended ensemble of past features, reducing cross-task semantic interference.
  • This mechanism is complemented by Semantic-Enhanced Visual Prototype Refinement (SE-VPR), where visual prototypes are updated using inter-class semantic affinities computed in the text embedding space, bridging the visual–text gap (He et al., 3 Aug 2025).

5. Transmission-Aware Prompt Learning and Channel-Robust SemCom

Semantic-Enhanced CLIP has also been instantiated for zero-shot semantic communication (“SemCom”) over noisy channels. The SemCLIP framework replaces raw-image transmission with CLIP-generated tokens, dramatically increasing bandwidth efficiency. The pipeline includes:

  • SNR-adaptive Deep Joint Source-Channel Coding (DeepJSCC) encodes CLIP image tokens into channel codes;
  • The receiver includes a Transmission-Aware Prompt Learning (TAPL) module, generating conditional prompt vectors πi=Pϕ(s^i)\pi_i = P_\phi(\hat{s}_i), which are combined with context and class-name vectors before CLIP text encoding.
  • Losses are staged for token reconstruction (LJSCC\mathcal{L}_\text{JSCC}) and contrastive image–text alignment (LTAPL\mathcal{L}_\text{TAPL}).
  • This architecture achieves an 41%\approx 41\% absolute zero-shot performance improvement over baseline CLIP-FT at 5-5 dB SNR, and up to 5592×55-92\times bandwidth reduction at matched accuracy (Hu et al., 25 Feb 2025).

6. Architectural, Textual, and Image-level Enhancements

Other facets of SeeCLIP involve surgery at the model and data levels, as in ITACLIP’s training-free pipeline for open-vocabulary segmentation:

  • Self-self attention in the final ViT layer prioritizes diagonal (self-) and local attentions, sharpening localization.
  • Removal of the final FFN in ViT preserves fine-grained patch correlations.
  • Multi-layer attention fusion averages maps from the final and mid-deep layers for spatial/semantic complementarity.
  • Ensemble-based image augmentation (blur, grayscale, flips) is used to generate more robust per-patch representations.
  • Textual features are augmented using synonyms or definitions generated by LLMs, and combined via a weighted sum with original class embeddings (Aydın et al., 2024).

Such architectural and data-driven enhancements are orthogonal and can be integrated with the prompt- and token-based advances described above.

7. Benchmarks and Empirical Outcomes

Across diverse domains—open-set recognition, continual learning, zero-shot communication, and segmentation—Semantic-Enhanced CLIP frameworks consistently eclipse vanilla CLIP and recent PEFT or prompt-based baselines.

  • In open-set domain generalization, SeeCLIP surpasses the next-best CLIP-based method by an average of +3.0%+3.0\% closed-set accuracy and +5.0%+5.0\% H-score across five benchmarks. Component-wise ablations confirm marked gains from semantic-aware prompt enhancement, diffusion-based hard negative synthesis, and duplex contrastive learning (Wang et al., 21 Nov 2025).
  • In class-incremental learning, SECA improves over hybrid adapter+prompt and knowledge distillation baselines by +45%+4-5\% “Last” accuracy on ImageNet-A, and bridges the modality gap by bringing refined prototypes closer to textual centroids (He et al., 3 Aug 2025).
  • In zero-shot semantic communication, SemCLIP achieves 85%85\% zero-shot accuracy at a $0.0015$ bandwidth ratio, outperforming DeepJSCC-IR and BT-IR by more than 50×50\times in bandwidth savings (Hu et al., 25 Feb 2025).
  • For open-vocabulary segmentation, ITACLIP contributes $1-2$ mIoU improvements over prior state-of-the-art via orthogonal semantic and architectural modifications.

The modularity of SeeCLIP techniques allows their deployment across foundational models (including video–text and audio–text) and cooperative multi-agent scenarios.


References

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Semantic-Enhanced CLIP (SeeCLIP).