CogVideoX-5B: Open Text-to-Video Model

Updated 18 December 2025

CogVideoX-5B is a large-scale text-to-video diffusion model with 5B parameters that converts text prompts into temporally coherent videos.
It employs an expert transformer architecture with modality-specific adaptive layer normalization and a 3D VAE for effective spatio-temporal compression.
Optimizations like Direct Preference Optimization and block-sparse attention yield state-of-the-art training stability and efficient high-quality video generation.

CogVideoX-5B is an open large-scale text-to-video diffusion model with approximately 5 billion parameters, designed for high-fidelity, temporally coherent video generation from open-ended text prompts. Architected around an expert transformer with a spatio-temporal 3D variational autoencoder and advanced data curation pipelines, CogVideoX-5B establishes new benchmarks for semantic alignment, motion continuity, and computational efficiency among generative video models. Its extensible architecture supports integration with preference optimization and block-sparse acceleration frameworks, as demonstrated by recent work on Direct Preference Optimization and efficient step distillation.

1. Architecture and Model Design

CogVideoX-5B is constructed on a transformer-based denoising diffusion backbone. The model backbone consists of approximately 48 transformer blocks (in some variants: 42 blocks, 48 heads of dimension 64 per block) with a model dimension $D \approx 4096$ . Both text and video sequence tokens are processed: text embeddings (from T5 or CLIP encoders, $D=4096$ for the 5B variant) are concatenated with video latent tokens derived from a 3D variational autoencoder (VAE).

A core architectural innovation is the “Expert Transformer” structure utilizing expert-adaptive LayerNorm (AdaLN). For each transformer block, separate layer normalization parameters are dynamically adapted for vision and text modalities via a two-layer MLP conditioned on the diffusion timestep. This enables context-dependent, modality-specific fusion throughout the network without explicit routing or gating, improving text-video alignment and generative fidelity.

Video patches in CogVideoX-5B are tokenized such that a 480×720@8fps video corresponds to a sequence length of $N \approx 17\,550$ , presenting significant memory and compute demands for self-attention. All transformer attention and feedforward layers process concatenated textual and video tokens, facilitating deep cross-modal interaction.

2. Spatio-Temporal Compression with 3D VAE

CogVideoX-5B employs a 3D causal VAE to encode raw RGB videos $x \in \mathbb{R}^{T \times H \times W \times 3}$ to compressed latents $z \in \mathbb{R}^{T/4 \times H/8 \times W/8 \times 4}$ , achieving a 256× spatial-temporal compression. The VAE is trained with:

Pixel-space $L_2$ and LPIPS reconstruction losses,
A spatio-temporal GAN loss with discriminator $D$ ,
KL regularization toward a unit Gaussian.

All convolutions are temporally causal (past-only padding), supporting autoregressive generation. Training is staged: the VAE is pre-trained on short, low-resolution clips and refined on longer, higher-resolution samples using context-parallel 3D convolutions.

3. Training Paradigms and Data Pipeline

The diffusion process follows a DDPM formulation with $T=1000$ timesteps and v-prediction; loss is minimized between model-predicted and ground-truth v-noise vectors. A “zero SNR” (Signal-to-Noise Ratio) schedule is used for the noise schedule.

Progressive, multi-resolution “frame-pack” training batches videos of diverse lengths by padding clipped sequences, allowing simultaneous training on 1–32-frame videos. Resolution progresses in three stages: low-res (360×360, 4 fps), high-res (720×480, 8 fps), and fine-tuning on a 20% subset of the cleanest data. The video captioning and semantic filtering pipeline includes multi-label Video-LLaMA classifiers for negative tag filtering, dense optical flow and aesthetics-based filtering, and hybrid captioning via both CogVLM and Llama-2/3 variants.

Dense video-caption pairs are created using short initial captions (Panda-70M), dense framewise recaps (CogVLM), GPT-4 summarization, and large-scale imitation learning to produce coherent, instructional prompt–video pairs.

4. Preference Optimization and Preference Alignment

Preference learning for CogVideoX-5B leverages Direct Preference Optimization with Clipping & Masking (DPO-C M) and Importance-Sampled Direct Preference Optimization (SDPO) (2505.21893). Here, optimization operates over human-ranked video pairs, incorporating timestep-aware importance weights:

DPO-C M introduces per-timestep importance weighting $w(t)=p_\theta(x_{t-1}|x_t)/q(x_{t-1}|x_t,x_0)$ with symmetric clipping,
SDPO applies off-policy correction via importance ratios to fully correct drift between model and data-collection policies: $w_\theta(x_t,t)=p_\theta(x_{t-1}|x_t)/p_{\text{ref}}(x_{t-1}|x_t)$ , with stepwise inversion and clipping.

Key findings:

Preference signal is strongest in intermediate timesteps ( $t \approx 500–600$ ); early steps (high noise) are automatically down-weighted by SDPO.
SDPO achieves the highest VBench total (82.28), quality (83.37), and semantic (77.91) scores for CogVideoX-5B, outperforming both Diffusion-DPO and DPO-C M.
The SDPO framework confers stability across training hyperparameters, robust preference alignment, and mitigates overfitting to noisy offline-preference labels. No dedicated 5B human preference study was conducted, but 2B results generalize.

5. Block-Sparse Acceleration and Efficient Distillation

To address the quadratic cost of self-attention for long video sequences ( $N \approx 17\,550$ ), CogVideoX-5B supports efficient acceleration via Adaptive Block-Sparse Attention (ASA) and step distillation within the Video-BLADE framework (Gu et al., 14 Aug 2025). ASA dynamically prunes uninformative attention blocks by:

Reordering tokens along a 3D Gilbert curve,
Partitioning into blocks of size $b=128$ ,
Generating a per-layer binary attention mask via two-stage sampling and threshold-based block selection.

Content-aware masks are constructed per-attention layer. For inference, ASA can be directly applied to pre-trained weights, providing a 4–6× speedup; when combined with step distillation via Trajectory Distribution Matching (TDM), the model achieves:

82% sparsity (i.e., only 18% of block pairs are attended),
8.89× inference acceleration (teacher: 50 steps, student: 8 steps),
VBench-2.0 quality improvement: baseline (0.534) to ASA_GT (0.569),
Human preference win rate (16/50 vs. 10/50) maintains or slightly exceeds the dense teacher,
Physics subscore improves (0.618 vs. 0.539).

No video or paired data is needed for step distillation: only text prompts and teacher outputs are required. All weights except the self-attention layers remain frozen.

Metric	Baseline	ASA_GT (BLADE student)	Speedup
VBench-2.0 Score	0.534	0.569	8.89×
Physics (VBench-2.0)	0.539	0.618	–
Human preference (win/tie)	—	16/24	–

6. Empirical Performance and Evaluation

On standard automated and human benchmarks CogVideoX-5B demonstrates state-of-the-art performance for open text-to-video models (Yang et al., 2024):

Automated VBench metrics (held-out prompts): dynamic quality, semantic alignment, and multi-object scene handling metrics are highest among open models.
Human evaluation (100 prompts, 5-point scale, vs. Kling): highest total rating (2.74 vs. 2.17), with notable advantages in sensory quality, instruction following, and physical simulation.
VBench submetrics (baseline, not sparse-distros): HumanAct: 96.8, Scene: 55.44, DynamicDegree: 62.22, MultiObjects: 70.95, Appear.Style: 24.44, DynamicQuality: 69.5, GPT4o-MTScore: 3.36.

7. Limitations and Prospects

CogVideoX-5B’s limitations include its computational overhead in dense mode, reliance on high-quality, dense preference pairs for optimal SDPO performance, and the need for high-volume, high-quality text-captioned video data. The SDPO framework does not perfectly account for model drift in non-stationary settings, and block-sparse distillation, while data-free, introduces complexity into deployment scheduling.

Future research directions include adaptive clipping in SDPO, online preference integration, hybrid supervised/RL fine-tuning, more efficient sparsity mask selection, and extending SDPO and BLADE to audio-video or multimodal settings.

References

"CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer" (Yang et al., 2024)
"SDPO: Importance-Sampled Direct Preference Optimization for Stable Diffusion Training" (2505.21893)
"Video-BLADE: Block-Sparse Attention Meets Step Distillation for Efficient Video Generation" (Gu et al., 14 Aug 2025)

Markdown Report Issue Upgrade to Chat

References (3)

SDPO: Importance-Sampled Direct Preference Optimization for Stable Diffusion Training (2025)

Video-BLADE: Block-Sparse Attention Meets Step Distillation for Efficient Video Generation (2025)

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CogVideoX-5B.