Semantic-Enhanced CLIP (SeeCLIP)

Updated 15 January 2026

The paper presents semantic-enhanced CLIP that employs prompt enhancement, semantic tokenization, and margin-based duplex contrastive learning to improve open-set recognition (e.g., +5% H-score boost).
It leverages adaptive knowledge transfer with textual priors and semantic-guided diffusion to bridge modality gaps and mitigate catastrophic forgetting in continual learning.
Additionally, SeeCLIP achieves significant efficiency in zero-shot semantic communication, realizing up to 55–92× bandwidth reduction while maintaining high accuracy over baseline CLIP models.

Semantic-Enhanced CLIP (SeeCLIP) refers to a family of methods and frameworks that augment and adapt the original Contrastive Language-Image Pretraining (CLIP) model to endow it with deeper semantic understanding, finer-grained alignment, structural robustness, and broader task flexibility. These enhancements target limitations of vanilla CLIP in settings such as open-set domain generalization, continual learning, semantic communication, and dense prediction. Typical techniques involve prompt engineering, semantic tokenization, diffusion-based hard negative synthesis, adaptive channel coding, and textual-informed prototype refinement. Recent work anchors SeeCLIP as a flexible paradigm for bridging granular semantics, cross-domain transfer, and multi-modal communication in CLIP-based systems (Wang et al., 21 Nov 2025, He et al., 3 Aug 2025, Hu et al., 25 Feb 2025).

1. Motivation and Scope

CLIP provides powerful cross-modal representations by aligning images and texts in a joint embedding space. However, several recognized challenges motivate semantic enhancement:

Structural risk–open-space risk trade-offs in open-set recognition, where CLIP tends to overfit to known classes yet misclassifies unknowns.
Modality gaps: the generalizability of text-only CLIP classifiers is offset by limited plasticity, while naïve visual classifiers lack semantic richness.
Task-specific inflexibility, such as fixed class vocabularies, and sensitivity to domain shift or channel noise.
Underperformance in fine-grained or dense tasks where patch- or token-level semantic structures are not explicitly modeled.

Semantic-Enhanced CLIP approaches systematically address these deficits by introducing modules for prompt enrichment, semantic token decomposition, joint source-channel coding, and knowledge transfer mechanisms (Wang et al., 21 Nov 2025, He et al., 3 Aug 2025, Hu et al., 25 Feb 2025).

2. Semantic-Aware Prompt Enhancement and Tokenization

A core element of SeeCLIP is the explicit extraction and utilization of fine-grained semantic tokens from CLIP’s vision encoder. Instead of relying solely on global embeddings, SeeCLIP decomposes image representations via learnable queries applied to patch features from CLIP’s ViT backbone:

$\omega_i^{(k)} = \frac{\exp\left(q^{(k)} \cdot f_i\right)}{\sum_{j=1}^N \exp\left(q^{(k)} \cdot f_j\right)}$

$v_\text{sem}^{(k)} = \sum_{i=1}^N \omega_i^{(k)} f_i$

where $f_i$ are patch embeddings, $q^{(k)}$ are learnable queries, and $v_\text{sem}^{(k)}$ are the resulting semantic tokens. For known classes, prompts integrate these tokens along with class names and domain-average features:

$p_c = \left[\, \Phi(v_\text{dom}), \Psi_1(v_\text{sem}^{(1)}), \dots, \Psi_K(v_\text{sem}^{(K)}),\, \texttt{[classname]}\, \right]$

Unknown-class prompts omit the class name and use class-agnostic vectors and an [unknown] token. This prompt enrichment enables far more nuanced vision-language alignment, especially important under open-set or domain-shifted conditions (Wang et al., 21 Nov 2025).

3. Margin-Based Duplex Contrastive Learning and Diffusion Hard Negatives

To explicitly control the spatial arrangement of known and unknown classes in the embedding space, SeeCLIP frameworks introduce duplex contrastive learning:

Repulsion loss ensures the unknown prompt is sufficiently distant from known class samples:

$\mathcal{L}_\text{rep} = \sum_{c=1}^C \max\left(0, \delta - \text{sim}(F_t(p_\text{unk}), F_v(X_c))\right)$

Cohesion loss prevents the unknown prompt from drifting into semantically irrelevant regions, by anchoring it near the centroid of known prompts:

$\mathcal{L}_\text{coh} = \left\| F_t(p_\text{unk}) - \frac{1}{C}\sum_{c=1}^{C} F_t(p_c) \right\|_2^2$

Complementary to these objectives, a semantic-guided diffusion module synthesizes pseudo-unknown images by perturbing semantic tokens and rerendering images using a latent diffusion model. This exposes the model to “hard unknowns”–samples that resemble known classes but manifest semantic edge cases—a critical mechanism for fine-tuning open-space risk boundaries (Wang et al., 21 Nov 2025).

4. Semantic-Guided Adaptive Knowledge Transfer in Continual Learning

Semantic enhancement in continual learning leverages textual priors to alleviate catastrophic forgetting and support incremental class injection. The Semantic-Enriched Continual Adaptation (SECA) framework provides a cross-modal, selective distillation approach:

A pool of historical visual adapters encodes past knowledge.
For each incoming sample, textual embeddings assess the semantic alignment to each adapter using projectors $W_S$ and $W_V$ . Instance-relevant adapters are assigned higher aggregation weights:

$v_\text{sem}^{(k)} = \sum_{i=1}^N \omega_i^{(k)} f_i$ 0

$v_\text{sem}^{(k)} = \sum_{i=1}^N \omega_i^{(k)} f_i$ 1

$v_\text{sem}^{(k)} = \sum_{i=1}^N \omega_i^{(k)} f_i$ 2

The student model is then trained to mimic only the relevant, textually-attended ensemble of past features, reducing cross-task semantic interference.
This mechanism is complemented by Semantic-Enhanced Visual Prototype Refinement (SE-VPR), where visual prototypes are updated using inter-class semantic affinities computed in the text embedding space, bridging the visual–text gap (He et al., 3 Aug 2025).

5. Transmission-Aware Prompt Learning and Channel-Robust SemCom

Semantic-Enhanced CLIP has also been instantiated for zero-shot semantic communication (“SemCom”) over noisy channels. The SemCLIP framework replaces raw-image transmission with CLIP-generated tokens, dramatically increasing bandwidth efficiency. The pipeline includes:

SNR-adaptive Deep Joint Source-Channel Coding (DeepJSCC) encodes CLIP image tokens into channel codes;
The receiver includes a Transmission-Aware Prompt Learning (TAPL) module, generating conditional prompt vectors $v_\text{sem}^{(k)} = \sum_{i=1}^N \omega_i^{(k)} f_i$ 3, which are combined with context and class-name vectors before CLIP text encoding.
Losses are staged for token reconstruction ( $v_\text{sem}^{(k)} = \sum_{i=1}^N \omega_i^{(k)} f_i$ 4) and contrastive image–text alignment ( $v_\text{sem}^{(k)} = \sum_{i=1}^N \omega_i^{(k)} f_i$ 5).
This architecture achieves an $v_\text{sem}^{(k)} = \sum_{i=1}^N \omega_i^{(k)} f_i$ 6 absolute zero-shot performance improvement over baseline CLIP-FT at $v_\text{sem}^{(k)} = \sum_{i=1}^N \omega_i^{(k)} f_i$ 7 dB SNR, and up to $v_\text{sem}^{(k)} = \sum_{i=1}^N \omega_i^{(k)} f_i$ 8 bandwidth reduction at matched accuracy (Hu et al., 25 Feb 2025).

6. Architectural, Textual, and Image-level Enhancements

Other facets of SeeCLIP involve surgery at the model and data levels, as in ITACLIP’s training-free pipeline for open-vocabulary segmentation:

Self-self attention in the final ViT layer prioritizes diagonal (self-) and local attentions, sharpening localization.
Removal of the final FFN in ViT preserves fine-grained patch correlations.
Multi-layer attention fusion averages maps from the final and mid-deep layers for spatial/semantic complementarity.
Ensemble-based image augmentation (blur, grayscale, flips) is used to generate more robust per-patch representations.
Textual features are augmented using synonyms or definitions generated by LLMs, and combined via a weighted sum with original class embeddings (Aydın et al., 2024).

Such architectural and data-driven enhancements are orthogonal and can be integrated with the prompt- and token-based advances described above.

7. Benchmarks and Empirical Outcomes

Across diverse domains—open-set recognition, continual learning, zero-shot communication, and segmentation—Semantic-Enhanced CLIP frameworks consistently eclipse vanilla CLIP and recent PEFT or prompt-based baselines.

In open-set domain generalization, SeeCLIP surpasses the next-best CLIP-based method by an average of $v_\text{sem}^{(k)} = \sum_{i=1}^N \omega_i^{(k)} f_i$ 9 closed-set accuracy and $f_i$ 0 H-score across five benchmarks. Component-wise ablations confirm marked gains from semantic-aware prompt enhancement, diffusion-based hard negative synthesis, and duplex contrastive learning (Wang et al., 21 Nov 2025).
In class-incremental learning, SECA improves over hybrid adapter+prompt and knowledge distillation baselines by $f_i$ 1 “Last” accuracy on ImageNet-A, and bridges the modality gap by bringing refined prototypes closer to textual centroids (He et al., 3 Aug 2025).
In zero-shot semantic communication, SemCLIP achieves $f_i$ 2 zero-shot accuracy at a $f_i$ 3 bandwidth ratio, outperforming DeepJSCC-IR and BT-IR by more than $f_i$ 4 in bandwidth savings (Hu et al., 25 Feb 2025).
For open-vocabulary segmentation, ITACLIP contributes $f_i$ 5 mIoU improvements over prior state-of-the-art via orthogonal semantic and architectural modifications.

The modularity of SeeCLIP techniques allows their deployment across foundational models (including video–text and audio–text) and cooperative multi-agent scenarios.

References

(Aydın et al., 2024) ITACLIP: Boosting Training-Free Semantic Segmentation with Image, Text, and Architectural Enhancements
(Hu et al., 25 Feb 2025) Zero-Shot Semantic Communication with Multimodal Foundation Models
(He et al., 3 Aug 2025) Harnessing Textual Semantic Priors for Knowledge Transfer and Refinement in CLIP-Driven Continual Learning
(Wang et al., 21 Nov 2025) The Finer the Better: Towards Granular-aware Open-set Domain Generalization