Chinese CLIP: Vision-Language Pretraining for Chinese

Updated 18 February 2026

Chinese CLIP is a family of contrastive vision-language models designed for Chinese text and images, enabling effective retrieval, zero-shot classification, and scene text recognition.
These models leverage dual-tower architectures with ViT-based image encoders and Chinese-specific transformer text encoders, utilizing strategies like two-stage pretraining and knowledge distillation.
Applications range from cultural heritage retrieval to fine-grained region-level tasks, while ongoing work focuses on model compression and improved semantic alignment.

Chinese CLIP refers to a family of contrastive vision-language pretraining models and associated resources expressly designed for high-performance cross-modal understanding in Chinese. These models extend and adapt the CLIP (Contrastive Language–Image Pretraining) paradigm to Chinese, addressing challenges such as data scarcity, linguistic structure, large vocabulary, and script-specific considerations. Chinese CLIP models are fundamental to a broad range of applications, including retrieval, zero-shot classification, scene text recognition, and generative modeling in the Chinese language context, as well as forming a cornerstone of multilingual and bilingual multimodal systems.

1. Foundations and Dataset Construction

Chinese CLIP models require large-scale, high-quality Chinese image–text pairs. Early advances centered on aggregating and filtering Chinese-labeled data from LAION-5B (~108M pairs), Wukong (~72M), translated English corpora (e.g., Visual Genome, MSCOCO captions), and internal collections, resulting in datasets of up to 200M image–text pairs that cover both general domains (e.g., COCO, Visual Genome) and Chinese-native sources such as e-commerce and news (Yang et al., 2022).

Subsequent efforts have further focused on data curation rigor and semantic coverage. For instance, DanQing compiles 100M pairs from the 2024–2025 Common Crawl, using a rigorous, multi-stage filtration process including content safety, lexical and structural analysis, information density metrics, redundancy elimination, and explicit Chinese–image alignment validation. DanQing achieves a notable balance: 100% URL validity, strict NSFW filtering, and a higher semantic–visual uniformity than previous datasets, supporting models trained for state-of-the-art performance in Chinese tasks (Shen et al., 15 Jan 2026).

Dataset	Size (M)	Year(s)	Filtering	Domain coverage
Wukong	100	2022	Moderate, manual	Generic, e-comm., news
DanQing	100	2024–25	Deep multi-stage	Contemporary, balanced
TaiSu	166	2022	Strict	Generic
Zero	250	2022	Variable	General (CCMB)

2. Model Architectures and Pretraining Paradigms

Chinese CLIP models typically employ a dual-tower architecture: an image encoder (ResNet-50 or Vision Transformer — ViT) and a Chinese-specific text encoder (BERT, RoBERTa, XLM-R, or custom transformers) (Yang et al., 2022, Chen et al., 2022, Yuan et al., 16 May 2025). All architectures implement projection heads mapping modality-specific outputs into a shared embedding space of 512–1024 dimensions.

Several variants have been introduced:

Chinese CLIP (Yang et al., 2022): Five sizes from RN50 (77M) to ViT-H/14 (958M), text encoders based on Chinese RoBERTa (Base/Large), matched with ViT variants. The largest uses ViT-H/14 (32L, 1280-D) and RoBERTa-wwm-Large (24L, 1024-D).
AltCLIP (Chen et al., 2022): Replaces CLIP’s text encoder with XLM-R (Large) for multilingual support, leveraging a projection head to maintain compatibility with the ViT-L/14 visual backbone.
EfficientCLIP (Wang et al., 2021): Introduces a Chinese transformer text encoder distilled from English CLIP and employs iterative confident-noise filtering with additional masked language modeling.
DanQing + SigLIP2 (Shen et al., 15 Jan 2026): Employs a ViT visual encoder and 12-layer BERT-style text encoder, with improved projection modules and training objectives.

Variant	Img Encoder	Text Encoder	Params (M)	Embed Dim
CN-CLIP ViT-B/16	ViT-B/16	RoBERTa-wwm-Base (12L)	188	512
CN-CLIP ViT-H/14	ViT-H/14	RoBERTa-wwm-Large (24)	958	1024
AltCLIP	ViT-L/14	XLM-R Large (24L)	3,220	512
EfficientCLIP	ViT-B/32	CN-transformer(12–32L)	151–225	512
SigLIP2 (DanQing)	ViT-L/16	BERT-style (12L)	1,000	1,024

3. Training Strategies and Objectives

Pretraining in Chinese CLIP systems consistently applies symmetric contrastive (InfoNCE) loss over minibatches of paired images and texts. Several distinctive approaches optimize data and model efficiency:

Two-Stage Pretraining (Yang et al., 2022, Chen et al., 2022): First, lock the visual encoder, tuning only the text encoder (Locked-Image Tuning, LiT or teacher-distillation), then jointly optimize all parameters. This regime boosts performance especially on cross-lingual and translated data.
Knowledge Distillation (Wang et al., 2021, Chen et al., 2022, Zhang et al., 2024): Aligns a Chinese text encoder to a pretrained CLIP (English) encoder using large-scale parallel data before contrastive image–text training.
Data-Centric Noise Handling: EfficientCLIP employs Ensemble Confident Learning (EnCL) to filter noisy pairs during training, selecting high-confidence samples to maintain retrieval accuracy and textual generalizability (Wang et al., 2021).
Loss Variants and Augmentation: Major variants extend CLIP’s contrastive objective: the SigLIP loss (binary classification per pair), patch-level and region-wise discriminative objectives (as in FG-CLIP 2 (Xie et al., 13 Oct 2025)), and masked language modeling (for text-tower pretraining).

Hyperparameters such as batch size (up to 32,768 for larger models), cosine learning schedules, and mixed-precision optimizers (e.g., AdamW, 8-bit Lion) are commonly used. Extreme scale is supported by gradient checkpointing and hardware accelerators (A800/H100 GPUs).

4. Downstream Tasks and Evaluation Protocols

Chinese CLIP evaluation follows established CLIP benchmarks adapted to Chinese:

Cross-Modal Retrieval: Standard datasets include MUGE (e-commerce), COCO-CN, Flickr30K-CN, and recent Chinese fine-grained benchmarks such as DCI-CN, LIT-CN, BoxClass-CN (Yang et al., 2022, Shen et al., 15 Jan 2026, Xie et al., 13 Oct 2025).
Zero-Shot Image Classification: Using ELEVATER and similar protocols, labels and prompts are translated to Chinese. DanQing (SigLIP2-B/32) achieves 65.4% average accuracy across 12 tasks, outperforming prior Chinese pretraining (Shen et al., 15 Jan 2026).
Scene Text and Character Recognition: Chinese CLIP-like models align images to Ideographic Description Sequences (IDS), enabling genuine zero-shot and few-shot recognition for novel characters and complex layouts (Yu et al., 2023, Li et al., 5 Jun 2025).
Fine-Grained and Region-Level Tasks: FG-CLIP 2 introduces region–text alignment, long-caption retrieval, and bounding box classification specific to Chinese fine-grained semantics (Xie et al., 13 Oct 2025).
Cultural Heritage and Local Matching: Domain-adapted versions apply local-alignment at inference, as in LACLIP on the CulTi dataset, achieving strong motif-to-text retrieval in ancient silk and mural datasets (Yuan et al., 16 May 2025).

Model	MUGE MR (ZS/FT)	COCO-CN R@1	ImageNet-CN Top-1	DCI-CN R@1	LIT-CN R@1
CN-CLIP-L/14	74.1 / 80.1	—	—	—	45.7
AltCLIP	—	63.9	59.6	—	—
FG-CLIP 2	—	—	—	74.3	82.4
DanQing/SigLIP2-B/32	—	—	—	—	—

5. Specializations and Practical Developments

Several specialized Chinese CLIP variants and frameworks have addressed domain-specific needs:

Scene Text Retrieval and Layout Robustness: CSTR-CLIP augments Chinese CLIP with text-region convolution and global feature fusion, excelling at vertical, cross-line, and partial-scene text layouts (e.g., DL-CSVTR), outperforming prior region-cropping paradigms in both accuracy (+18.8% mAP) and speed (Li et al., 5 Jun 2025).
Cultural Heritage: By fine-tuning CN-CLIP and employing patch-wise similarity aggregation (LACLIP), cross-modal retrieval over small, motif-focused datasets (e.g., CulTi) achieves state-of-the-art mean recall—critical for the fine granularity needed in arts and cultural studies (Yuan et al., 16 May 2025).
Lightweight Multilingual Compression: DC-CLIP distills large multilingual vision-LLMs into compact (≈300M param.) models with only moderate accuracy loss, suitable for edge deployment in both Chinese and English (Zhang et al., 2024).
Cross-lingual Generative Transfer: Approaches such as IAP align a Chinese CLIP text encoder with its English counterpart in Stable Diffusion, facilitating robust Chinese prompt-based image synthesis with 5–10% of the data required by monolingual diffusion models (Hu et al., 2023).

6. Limitations and Open Challenges

Despite their successes, Chinese CLIP approaches are constrained by several factors:

Proper Noun and Domain-Specific Gaps: Even large-scale models show difficulty with rare proper names, new entities, and colloquial or dialectal phrases (Yang et al., 2022, Chen et al., 2022).
Reliance on Machine Translation: Many pretraining pipelines use machine-translated English datasets for Chinese, potentially propagating translation artifacts and errors (Chen et al., 2022, Visheratin, 2023).
Scaling and Data Plateaus: Some legacy datasets (e.g., Wukong, Zero) exhibit performance plateaus as data scale increases beyond ~30M; newer pipelines like DanQing mitigate this by maintaining higher semantic diversity (Shen et al., 15 Jan 2026).
Prompt Sensitivity and Negation Failures: Classification and retrieval performance can be highly sensitive to prompt design; certain negations drastically reduce accuracy (e.g., “no car” misclassified as “other”) (Yang et al., 2022).
Resource Constraints for Large Models: The highest retrieval and classification scores are delivered by models pushing 1B–3B parameters, incurring substantial memory and hardware requirements (Yang et al., 2022, Xie et al., 13 Oct 2025). Recent works on model compression and efficient training aim to address deployment in practical and edge scenarios (Wang et al., 2021, Zhang et al., 2024).

7. Resources, Benchmarks, and Future Directions

Chinese CLIP models and datasets are publicly available, e.g., via Chinese CLIP GitHub, AltCLIP, and ModelScope model hubs. The DanQing dataset, with up-to-date web coverage and open licensing, is positioned to become a new standard for Chinese vision-language pretraining (Shen et al., 15 Jan 2026).

Emerging directions include:

Bilingual and Multilingual Fine-Grained Alignment: Models like FG-CLIP 2 demonstrate state-of-the-art Chinese and English fine-grained retrieval using novel intra-text and cross-region objectives (Xie et al., 13 Oct 2025).
Scene and Region-Specific Adaptation: Ongoing integration of segmentation maps, patch-level loss, and multi-granularity matching address the challenges of Chinese typography, dense layouts, and local referencing (Li et al., 5 Jun 2025).
Knowledge Distillation and Model Compression: Techniques for compressing large CLIP variants into efficient, on-device deployable models are gaining traction, especially in multilingual contexts (Zhang et al., 2024, Wang et al., 2021).
Up-to-date Concept Coverage: Incorporating datasets with recent concepts and adaptivity to linguistic drift (e.g., DanQing) is critical for robust deployment in fast-evolving Chinese media contexts (Shen et al., 15 Jan 2026).

Chinese CLIP research exemplifies ongoing efforts to harmonize data quality, architectural scalability, cross-lingual knowledge transfer, and deployment efficiency in non-English vision–language intelligence. It remains a rapidly advancing foundation for multimodal Chinese understanding and applications.