Chinese Vision-Language Pre-training Advances

Updated 18 January 2026

Chinese Vision-Language Pre-training is a research domain that combines extensive Chinese image–text datasets with dual-encoder architectures to enable robust multimodal understanding.
It employs advanced methodologies like contrastive and generative learning, instruction tuning, and efficient distributed algorithms to optimize performance across tasks.
The field emphasizes scalable dataset construction and rigorous evaluation protocols, leading to state-of-the-art results in image classification, retrieval, captioning, and video analysis.

Chinese Vision-Language Pre-training (VLP) refers to the family of methodologies and large-scale models developed for multimodal learning across visual data (images, frames, video) and Chinese natural language. This research domain encompasses dataset construction, model architectures, objectives, distributed algorithms, and benchmark protocols specialized for Chinese or bilingual (Chinese/English) settings. The field has accelerated since 2021 with the emergence of Chinese CLIP, Wukong, WenLan, comprehensive bilingual models, decoder-based generative models, and instruction-tuned multimodal LLMs, all driven by the release of increasingly massive, higher-quality web-crawled corpora and the adaptation of state-of-the-art contrastive, generative, and retrieval-centric learning paradigms to the peculiarities of the Chinese language and digital ecosystem.

1. Large-scale Chinese Vision-Language Datasets

The development of high-quality Chinese VLP models is intrinsically tied to the availability of large, sophisticated image/video–text datasets. Early efforts such as WenLan’s RUC-CAS-WenLan (30 M image–text pairs) (Huo et al., 2021) and AIC-ICC provided initial resources. Wukong introduced a 100 M web-scraped Chinese benchmark, filtered for relevance and diversity using a combination of language and visual heuristics, and hosted a human-verified evaluation split (Gu et al., 2022). Chinese CLIP expanded the pool to 200 M pairs by aggregating LAION-5B (zh-tag), Wukong, and English–Chinese translations and adding rigorous filtering by semantically aligned cosine scores (Yang et al., 2022).

Recent years have shifted toward ultra-scale and rigorously quality-controlled datasets. DanQing (2026) delivers 100 M Chinese image–text pairs mined from 2024–2025 Common Crawl, using multi-stage linguistic, semantic, and cross-modal filtering, culminating in an up-to-date resource with improved alignment and semantic density over predecessors (Shen et al., 15 Jan 2026). BM-6B aggregates 3.0 B Chinese and 3.0 B English pairs through deduplicated and multi-source cleaning, supporting true bilingual training at scales comparable to LAION (Guo et al., 2024).

Video–language datasets have seen similar expansion. Alivol-10M (Victor) (Lei et al., 2021) and Youku-mPLUG (10 M pairs) (Xu et al., 2023) cover the Chinese e-commerce and open-video web, with strict safety, topic, and alignment checks, forming the backbone for Chinese video-text pre-training and benchmarking.

2. Model Architectures and Contrastive Frameworks

The foundational design pattern in Chinese VLP is the dual-encoder (two-tower) architecture, as exemplified by Chinese CLIP, Wukong, and WenLan-BriVL. Each modality is encoded independently (ViT/Swin-transformer for images or video frames; Chinese BERT/Roberta for text), with cross-modal similarity measured in a shared embedding space. Early approaches adopted batchwise InfoNCE contrastive losses, requiring large batches for sufficient negatives (Yang et al., 2022, Gu et al., 2022), or momentum-based MoCo negatives as in WenLan's BriVL (Huo et al., 2021), which uses queues to scale negative sampling efficiently under weak-alignment assumptions.

Architectural innovation has included:

Locked-Image Text Tuning (LiT): Stage-wise training that freezes image encoders in early rounds, adapting textual modules to pre-trained vision backbones for efficient convergence and improved downstream robustness (Yang et al., 2022, Gu et al., 2022).
Token-wise similarity metrics (FILIP-style): Fine-grained cross-modal alignment through attention over patch-level (visual) and token-level (textual) embeddings, proven critical for phrase localization and retrieval (Gu et al., 2022).
Reduced-token interaction: Efficiently reduces computational burden in token-matching by summarizing visual features, striking a trade-off between compute and retrieval accuracy (Gu et al., 2022).
Generative pre-training: Unified autoregressive transformers (ERNIE-ViLG) that accommodate bidirectional generation (text↔image), employing discrete VQGAN encoding and UniLM-style attention masking (Zhang et al., 2021).

3. Distributed Pre-training and Data Efficiency

Scale has necessitated significant advances in distributed optimization and memory management for Chinese VLP. The M2-Encoder introduces grouped aggregation ITC loss (GBA-ITC), partitioning GPUs into groups for local contrastive computation, thereby dramatically reducing communication and memory overhead and enabling effective batch sizes over 100k on 256 GPUs—achieving ~1.6× throughput improvement over standard All-Gather ITC (Guo et al., 2024). PaddlePaddle’s hybrid parallelism, optimizer sharding, and FP16 training are widely adopted for practical scaling of 10+ billion parameter architectures (Zhang et al., 2021, Guo et al., 2024).

DC-CLIP exemplifies further data- and compute-efficiency for multilingual vision–language deployment via two-stage knowledge distillation and contrastive alignment. Compact student models inherit feature spaces from AltCLIP teachers through Smooth-L1 regression and InfoNCE, resulting in 300 M parameter models with minimal latency suitable for edge devices (Zhang et al., 2024).

4. Generative and Instruction-tuned Multimodal Models

Generative pre-training for Chinese VLP was advanced by ERNIE-ViLG, which unified text-to-image and image-to-text synthesis, enabling bidirectional decoders with strong semantic alignment; pre-training utilized 145 M Chinese image–text pairs and yielded state-of-the-art FID (7.9) on MS-COCO (Zhang et al., 2021). Instruction tuning emerged as a paradigm with Ziya-Visual (Lu et al., 2023): a BLIP-2–style Q-Former compression bridges frozen ViT encoders and a large bilingual LLM. Three-stage training (vision–language contrastive/generative loss, broad instruction-tuning, scene-aware instruction-tuning) leverages both GPT-4 translated and in-context generated Chinese instruction–response pairs to enable multi-turn multi-modal reasoning. LoRA adapters are selectively trained for efficient bilingual adaptation of both ViT and LLM subcomponents.

In the video–language domain, VICTOR extends contrastive multimodal pre-training with complex proxy tasks (masked sentence order, masked frame order, intra- and inter-masked frame modeling) on Alivol-10M, integrating temporal/shuffling tasks to enhance sequential modeling for narrative video (Lei et al., 2021). Youku-mPLUG introduces modular LLM-based decoders (e.g., mPLUG-video frozen Bloomz) with minimal trainable parameters for video understanding, achieving SOTA on video categorization and captioning metrics (Xu et al., 2023).

5. Downstream Benchmarks and Evaluation Protocols

Chinese VLP models are systematically benchmarked on several public tasks:

Zero-shot image classification: Using translated CLIP-style prompts, models are evaluated on up to 20 Chinese-translated vision tasks (ImageNet, CIFAR, Caltech, EuroSAT, etc.) via top-1 accuracy (Yang et al., 2022, Gu et al., 2022, Shen et al., 15 Jan 2026).
Cross-modal retrieval: Datasets such as Flickr30K-CN, COCO-CN, MUGE, and AIC-ICC measure R@1/5/10 and mean recall in both text-to-image and image-to-text directions (Gu et al., 2022, Yang et al., 2022, Guo et al., 2024). Token-wise or reduced-token interaction techniques are directly reflected in retrieval gains.
Image captioning: Metrics include BLEU, METEOR, ROUGE-L, and CIDEr on COCO-CN, Flickr30K-CNA, and AIC-ICC (Yang et al., 2022, Zhang et al., 2021).
Video-language: Youku-mPLUG provides annotated split for video classification, captioning, and retrieval with specific accuracy and CIDEr benchmarks (Xu et al., 2023).
Multimodal reasoning: LLM-based evaluation protocols plug vision encoders into Chinese LLMs (e.g., Qwen2; LLaVA-style evaluation) for advanced tasks such as VQA and conversation, with average accuracy as the metric (Shen et al., 15 Jan 2026, Lu et al., 2023).

Zero-shot, fine-tuned, and instruction-response evaluations consistently show that data quality and semantic diversity, model scale, and proxy task design all yield measurable improvements and direct SOTA comparisons are available across multiple public datasets.

6. Key Empirical Insights and Ablations

Empirical studies demonstrate that:

Scaling dataset size (Wukong 12M → 100M; BM-6B: 3B → 6B pairs) monotonically improves retrieval and classification, although saturation and noise sensitivity set in beyond 100–250M pairs unless filtering is aggressive (DanQing) (Shen et al., 15 Jan 2026, Guo et al., 2024).
Character-level Chinese tokenization is essential, outperforming word-level approaches by up to 3 percentage points in zero-shot retrieval/classification (Gu et al., 2022).
Two-stage LiT→contrastive tuning yields higher downstream scores and more stable convergence than fully end-to-end or single-stage training (Yang et al., 2022).
Localized token recovery (as in CMLM/CMIM in M2-Encoder) and sequential reordering tasks (as in VICTOR's MSOM/MFOM) are critical for fine-grained alignment and event-ordered video modeling (Guo et al., 2024, Lei et al., 2021).
Modular, frozen LLM decoders (e.g., mPLUG-video/Bloomz) enable efficient scaling (1.7% trainable parameters) at minimal cost to SOTA scores in video understanding (Xu et al., 2023).

Ablation experiments further substantiate that cross-modal contrastive alignment, sentence/frame order tasks, high negative-sample diversity, instruction-tuning, and LoRA/fine-tuning strategies all contribute distinct and additive benefits to Chinese vision-language performance (Lei et al., 2021, Lu et al., 2023).

7. Open Problems, Limitations, and Future Directions

Despite rapid progress, major challenges persist:

Dataset construction remains labor- and compute-intensive; keeping corpora up-to-date for evolving Chinese semantics demands frequent, scalable curation (e.g., DanQing’s focus on 2024–2025 content) (Shen et al., 15 Jan 2026).
Existing pre-training remains focused on image/video paired with one short textual description; broader modalities (audio, multi-sentence narrative, extended dialogues) are not integrated in current SOTA models (Lei et al., 2021).
Cultural and conceptual biases within the Chinese digital ecosystem require further diversification to ensure global generalization (Xu et al., 2023).
The balance of compute–throughput–accuracy: Grouped-aggregation algorithms and efficient student–teacher distillation (as in DC-CLIP) are addressing memory and deployment, but further innovation is needed for sub-100M parameter edge deployment (Zhang et al., 2024).
Opportunities for multilingual alignment using parallel corpora (M2-Encoder, Ziya-Visual) and exploring annotation-efficient instruction response generation (e.g., GPT-4 iterative prompting, bilingual CLIP filtering) remain open.

Prospective research directions include: extending to additional modalities (audio, speech transcripts), parameter-efficient architectures, continuous incremental learning for evolving semantics, unified retrieval-generation LLM pre-training, and multilingual corpus curation with aligned domain coverage for cross-lingual/Chinese-centric vision–LLMs (Lei et al., 2021, Lu et al., 2023, Shen et al., 15 Jan 2026, Guo et al., 2024, Xu et al., 2023).