Semantic-Enhanced CLIP (SeeCLIP)
- The paper presents semantic-enhanced CLIP that employs prompt enhancement, semantic tokenization, and margin-based duplex contrastive learning to improve open-set recognition (e.g., +5% H-score boost).
- It leverages adaptive knowledge transfer with textual priors and semantic-guided diffusion to bridge modality gaps and mitigate catastrophic forgetting in continual learning.
- Additionally, SeeCLIP achieves significant efficiency in zero-shot semantic communication, realizing up to 55–92× bandwidth reduction while maintaining high accuracy over baseline CLIP models.
Semantic-Enhanced CLIP (SeeCLIP) refers to a family of methods and frameworks that augment and adapt the original Contrastive Language-Image Pretraining (CLIP) model to endow it with deeper semantic understanding, finer-grained alignment, structural robustness, and broader task flexibility. These enhancements target limitations of vanilla CLIP in settings such as open-set domain generalization, continual learning, semantic communication, and dense prediction. Typical techniques involve prompt engineering, semantic tokenization, diffusion-based hard negative synthesis, adaptive channel coding, and textual-informed prototype refinement. Recent work anchors SeeCLIP as a flexible paradigm for bridging granular semantics, cross-domain transfer, and multi-modal communication in CLIP-based systems (Wang et al., 21 Nov 2025, He et al., 3 Aug 2025, Hu et al., 25 Feb 2025).
1. Motivation and Scope
CLIP provides powerful cross-modal representations by aligning images and texts in a joint embedding space. However, several recognized challenges motivate semantic enhancement:
- Structural risk–open-space risk trade-offs in open-set recognition, where CLIP tends to overfit to known classes yet misclassifies unknowns.
- Modality gaps: the generalizability of text-only CLIP classifiers is offset by limited plasticity, while naïve visual classifiers lack semantic richness.
- Task-specific inflexibility, such as fixed class vocabularies, and sensitivity to domain shift or channel noise.
- Underperformance in fine-grained or dense tasks where patch- or token-level semantic structures are not explicitly modeled.
Semantic-Enhanced CLIP approaches systematically address these deficits by introducing modules for prompt enrichment, semantic token decomposition, joint source-channel coding, and knowledge transfer mechanisms (Wang et al., 21 Nov 2025, He et al., 3 Aug 2025, Hu et al., 25 Feb 2025).
2. Semantic-Aware Prompt Enhancement and Tokenization
A core element of SeeCLIP is the explicit extraction and utilization of fine-grained semantic tokens from CLIP’s vision encoder. Instead of relying solely on global embeddings, SeeCLIP decomposes image representations via learnable queries applied to patch features from CLIP’s ViT backbone:
where are patch embeddings, are learnable queries, and are the resulting semantic tokens. For known classes, prompts integrate these tokens along with class names and domain-average features:
Unknown-class prompts omit the class name and use class-agnostic vectors and an [unknown] token. This prompt enrichment enables far more nuanced vision-language alignment, especially important under open-set or domain-shifted conditions (Wang et al., 21 Nov 2025).
3. Margin-Based Duplex Contrastive Learning and Diffusion Hard Negatives
To explicitly control the spatial arrangement of known and unknown classes in the embedding space, SeeCLIP frameworks introduce duplex contrastive learning:
- Repulsion loss ensures the unknown prompt is sufficiently distant from known class samples:
- Cohesion loss prevents the unknown prompt from drifting into semantically irrelevant regions, by anchoring it near the centroid of known prompts:
Complementary to these objectives, a semantic-guided diffusion module synthesizes pseudo-unknown images by perturbing semantic tokens and rerendering images using a latent diffusion model. This exposes the model to “hard unknowns”–samples that resemble known classes but manifest semantic edge cases—a critical mechanism for fine-tuning open-space risk boundaries (Wang et al., 21 Nov 2025).
4. Semantic-Guided Adaptive Knowledge Transfer in Continual Learning
Semantic enhancement in continual learning leverages textual priors to alleviate catastrophic forgetting and support incremental class injection. The Semantic-Enriched Continual Adaptation (SECA) framework provides a cross-modal, selective distillation approach:
- A pool of historical visual adapters encodes past knowledge.
- For each incoming sample, textual embeddings assess the semantic alignment to each adapter using projectors and . Instance-relevant adapters are assigned higher aggregation weights:
- The student model is then trained to mimic only the relevant, textually-attended ensemble of past features, reducing cross-task semantic interference.
- This mechanism is complemented by Semantic-Enhanced Visual Prototype Refinement (SE-VPR), where visual prototypes are updated using inter-class semantic affinities computed in the text embedding space, bridging the visual–text gap (He et al., 3 Aug 2025).
5. Transmission-Aware Prompt Learning and Channel-Robust SemCom
Semantic-Enhanced CLIP has also been instantiated for zero-shot semantic communication (“SemCom”) over noisy channels. The SemCLIP framework replaces raw-image transmission with CLIP-generated tokens, dramatically increasing bandwidth efficiency. The pipeline includes:
- SNR-adaptive Deep Joint Source-Channel Coding (DeepJSCC) encodes CLIP image tokens into channel codes;
- The receiver includes a Transmission-Aware Prompt Learning (TAPL) module, generating conditional prompt vectors , which are combined with context and class-name vectors before CLIP text encoding.
- Losses are staged for token reconstruction () and contrastive image–text alignment ().
- This architecture achieves an absolute zero-shot performance improvement over baseline CLIP-FT at dB SNR, and up to bandwidth reduction at matched accuracy (Hu et al., 25 Feb 2025).
6. Architectural, Textual, and Image-level Enhancements
Other facets of SeeCLIP involve surgery at the model and data levels, as in ITACLIP’s training-free pipeline for open-vocabulary segmentation:
- Self-self attention in the final ViT layer prioritizes diagonal (self-) and local attentions, sharpening localization.
- Removal of the final FFN in ViT preserves fine-grained patch correlations.
- Multi-layer attention fusion averages maps from the final and mid-deep layers for spatial/semantic complementarity.
- Ensemble-based image augmentation (blur, grayscale, flips) is used to generate more robust per-patch representations.
- Textual features are augmented using synonyms or definitions generated by LLMs, and combined via a weighted sum with original class embeddings (Aydın et al., 2024).
Such architectural and data-driven enhancements are orthogonal and can be integrated with the prompt- and token-based advances described above.
7. Benchmarks and Empirical Outcomes
Across diverse domains—open-set recognition, continual learning, zero-shot communication, and segmentation—Semantic-Enhanced CLIP frameworks consistently eclipse vanilla CLIP and recent PEFT or prompt-based baselines.
- In open-set domain generalization, SeeCLIP surpasses the next-best CLIP-based method by an average of closed-set accuracy and H-score across five benchmarks. Component-wise ablations confirm marked gains from semantic-aware prompt enhancement, diffusion-based hard negative synthesis, and duplex contrastive learning (Wang et al., 21 Nov 2025).
- In class-incremental learning, SECA improves over hybrid adapter+prompt and knowledge distillation baselines by “Last” accuracy on ImageNet-A, and bridges the modality gap by bringing refined prototypes closer to textual centroids (He et al., 3 Aug 2025).
- In zero-shot semantic communication, SemCLIP achieves zero-shot accuracy at a $0.0015$ bandwidth ratio, outperforming DeepJSCC-IR and BT-IR by more than in bandwidth savings (Hu et al., 25 Feb 2025).
- For open-vocabulary segmentation, ITACLIP contributes $1-2$ mIoU improvements over prior state-of-the-art via orthogonal semantic and architectural modifications.
The modularity of SeeCLIP techniques allows their deployment across foundational models (including video–text and audio–text) and cooperative multi-agent scenarios.
References
- (Aydın et al., 2024) ITACLIP: Boosting Training-Free Semantic Segmentation with Image, Text, and Architectural Enhancements
- (Hu et al., 25 Feb 2025) Zero-Shot Semantic Communication with Multimodal Foundation Models
- (He et al., 3 Aug 2025) Harnessing Textual Semantic Priors for Knowledge Transfer and Refinement in CLIP-Driven Continual Learning
- (Wang et al., 21 Nov 2025) The Finer the Better: Towards Granular-aware Open-set Domain Generalization