Text-to-Video Generation Models

Updated 9 February 2026

Text-to-video generation models are deep generative systems that create semantically faithful and temporally coherent videos from natural language input.
They leverage diverse architectures such as GANs, VAEs, transformers, and diffusion frameworks to balance spatial fidelity and motion dynamics.
Evaluation combines automated metrics like FVD and CLIP-Similarity with human studies, while ongoing challenges include long-range consistency and computational efficiency.

Text-to-video (T2V) generation models are deep generative systems designed to synthesize temporally coherent and semantically faithful videos conditioned on natural language input. These models integrate advancements from text-to-image generation, video diffusion, and LLMs, and represent a central area in multimodal AI research. Modern T2V frameworks address a combination of spatial fidelity, temporal consistency, compositionality, motion dynamics, and alignment with user prompts. This article surveys the state-of-the-art in T2V generation, with emphasis on architectural paradigms, algorithmic strategies, evaluation, and the principal technical challenges.

1. Architectural Paradigms and Evolution

The development of T2V models has evolved through multiple computational architectures:

GAN-Based Models: Early approaches (e.g., MoCoGAN, NÜWA) combine adversarial video discriminators with conditional generative networks to enforce realism in both spatial and temporal domains. GANs operate by mapping text embeddings and sampled noise vectors to video frames using RNNs or 3D convolutions, with discriminators enforcing semantic and temporal consistency. GAN losses typically combine frame-level and clip-level discrimination, but these models are limited by mode collapse and suboptimal temporal coherence (Kumar et al., 6 Oct 2025).
VAE and Transformer Autoregressive Models: VQ-VAE compression is leveraged in systems such as VideoGPT and GODIVA, enabling autoregressive transformers to operate on discrete video tokens. Transformers predict sequences of frame or patch codes conditioned on text, yielding strong temporal coherence at the cost of high inference cost and quantization artifacts (Kumar et al., 6 Oct 2025, Singh, 2023).
Diffusion-Based Frameworks: State-of-the-art T2V generation has shifted toward diffusion models. These systems extend text-to-image diffusion backbones by introducing temporal modules—inflated convolutions, temporal self-attention, or explicit spatio-temporal factorizations. Cascaded architectures generate low-resolution, short video segments first, followed by spatial and temporal super-resolution refinements (e.g., Make-A-Video, Imagen Video) (Kumar et al., 6 Oct 2025, Singh, 2023, Li et al., 2023).
Hybrid and Modular Designs: Recent models compose pre-trained text-to-image generators with temporal video modules (e.g., EVS, I4VGen, Factorized Video Generation), decoupling content creation from motion synthesis. These factorized pipelines enable high-fidelity frame synthesis while relegating motion modeling to downstream stages (Su et al., 18 Jul 2025, Guo et al., 2024, Girdhar et al., 2023, Hassan et al., 18 Dec 2025).

2. Temporal Modeling, Conditioning, and Compositionality

Effective temporal modeling is critical for realistic video generation and is approached through diverse mechanisms:

Temporal Module Design: Inflated pseudo-3D convolutions and temporal attention layers are used to inject temporal priors into latent diffusion U-Nets or transformers (2212.11565, Tian et al., 2024, Li et al., 2023). Sparse attention schemes (as in Tune-A-Video) reduce computational cost while enabling both long-range consistency and local smoothness.
Compositional Attention and Scene Parsing: Models such as MOVAI and VideoTetris explicitly decompose prompts into scene graphs or spatially-masked sub-prompts, enabling compositional diffusion where the network attends to discrete objects, regions, and actions with fine spatial and temporal granularity (Patel, 30 Oct 2025, Tian et al., 2024). Reference Frame Attention and dynamic scene syntax from LLMs further regularize identity and motion over long sequences.
Image Conditioning and Anchoring: Several frameworks (Emu Video, Factorized Video Generation, VideoGen, I4VGen) begin with an explicit anchor image, generated either directly from a rewritten prompt or as a first frame, which the subsequent video model is conditioned on. This decoupling allows for high-quality initial scene layout and more precise motion synthesis (Girdhar et al., 2023, Hassan et al., 18 Dec 2025, Guo et al., 2024, Li et al., 2023).
Prompt Engineering and Optimization: The gap between training and deployment text (dense captions vs. user queries) motivates prompt optimization techniques, such as the VPO framework. Here, multi-stage supervised fine-tuning and preference optimization enforce harmlessness, accuracy, and helpfulness, enhancing both safety and alignment with end-user intent (Cheng et al., 26 Mar 2025).

3. Training Protocols, Continual and Zero-Shot Learning

Training regimes are adapted to the rapidly evolving landscape of data availability and generative needs:

Large-Scale Data and Multi-Stage Training: Leading models are pre-trained on tens to hundreds of millions of text–video pairs (WebVid-10M, HD-VILA-100M) or utilize synthetic video sources. Curricula progress from low spatial and temporal resolution to high-resolution, high-fidelity outputs (Girdhar et al., 2023, Li et al., 2023).
Continual Learning and Generative Replay: Instead of retraining from scratch upon new data, continual learning models (VidCLearn) employ student–teacher distillation and temporal consistency losses, with retrieved structural priors from past seen videos guiding inference (Zanchetta et al., 21 Sep 2025).
Zero-Shot and Training-Free Methods: Models such as Text2Video-Zero and FlowZero leverage pre-trained text-to-image diffusion backbones, repurposed by injecting cross-frame/motion mechanics and LLM-driven semantic layouts, to achieve end-to-end zero-shot video generation without additional training (Khachatryan et al., 2023, Lu et al., 2023). Latent-based compositional and reference-guided methods (EVS, I4VGen, MEVG) further extend plug-and-play capabilities (Su et al., 18 Jul 2025, Guo et al., 2024, Oh et al., 2023).

4. Quantitative and Qualitative Evaluation

Evaluation of T2V models encompasses several axes:

Automated Metrics:
- Frechet Video Distance (FVD): Measures distributional similarity in deep video feature space; lower values denote higher fidelity and temporal realism (Kumar et al., 6 Oct 2025).
- CLIP-Similarity (CLIP-SIM): Average cosine similarity between frame embeddings and prompt text in CLIP space, used as a proxy for alignment (Singh, 2023, Li et al., 2023).
- Inception Score (IS): Balances diversity and recognizability in individual frames or clips; however, it is insensitive to semantic correctness (Singh, 2023, Kumar et al., 6 Oct 2025).
- LPIPS, Motion Smoothness, Subject Consistency, and composite scores (DOVER, AP): Target more nuanced visual and dynamical properties (Su et al., 18 Jul 2025, Patel, 30 Oct 2025).
Human Preference and User Studies: Empirical benchmarks consistently show that models leveraging factorized conditioning, compositional attention, and continual learning outperform previous SOTA in direct comparison, both in human preference and specialized VBench/CompBench categories (Girdhar et al., 2023, Hassan et al., 18 Dec 2025, Tian et al., 2024, Oh et al., 2023, Su et al., 18 Jul 2025, Zanchetta et al., 21 Sep 2025).
Ablation Analysis: Removal of individual architectural modules (e.g., temporal attention, compositional scene parsing, visual anchoring) leads to pronounced declines in relevant metrics, underscoring their necessity for temporal and text alignment (Patel, 30 Oct 2025, Tian et al., 2024, Lu et al., 2023).

5. Computational Efficiency, Scaling, and Limitations

State-of-the-art T2V frameworks employ several strategies to balance computational tractability and video length/quality:

Grid and Latent Compression: Grid Diffusion Models represent longer videos as spatial grids, reducing temporal memory cost to that of a single image (Lee et al., 2024). Latent-space diffusion further sidesteps pixel-space computation, facilitating scalable inference for high-definition outputs (Li et al., 2023).
Plug-and-Play and Modular Composition: Encapsulated approaches (EVS, I4VGen) and control modules (e.g., ControlNet) allow flexible recombination of image and video-specific priors at inference, enabling rapid adaptation to new styles, conditions, and guidance signals without retraining (Su et al., 18 Jul 2025, Guo et al., 2024, Khachatryan et al., 2023).
Sampling Efficiency and Anchoring: Visual anchoring (FVG, Emu Video) substantially reduces the number of diffusion steps needed for high-quality samples, with FVG robust to 70% reduction in steps owing to increased sampling stability provided by a clean initial frame (Hassan et al., 18 Dec 2025, Girdhar et al., 2023).
Limitations: Despite progress, challenges remain in generating long videos with consistent identity/background, computational demands for multi-object compositionality, and sensitivity to the quality of prompt-to-scene parsing (Kumar et al., 6 Oct 2025, Hassan et al., 18 Dec 2025, Tian et al., 2024, Zanchetta et al., 21 Sep 2025, Oh et al., 2023).

6. Open Challenges and Research Directions

Key issues persist and inform the next generation of T2V systems:

Long-Range Temporal Consistency: Overcoming inconsistencies in object tracking and motion artifacts remains central. Hierarchical, memory-efficient, and grounded temporal modeling are areas of active research (Kumar et al., 6 Oct 2025, Tian et al., 2024).
Scalability and Efficient Training: Building on synthetic video datasets, leveraging game engine data, and distilling spatial-temporal diffusion to lightweight modules is expected to further reduce the compute gap (Kumar et al., 6 Oct 2025, Guo et al., 2024).
Compositionality and Control: Finer disentanglement of object, motion, and style codes, explicit scene graph conditioning (CSP, compositional diffusion), and plug-and-play editing interfaces are critical to practical deployment in creative workflows (Patel, 30 Oct 2025, Tian et al., 2024).
Evaluation and Benchmarking: The field is migrating toward multidimensional, perception-aligned benchmarks such as VBench and MonetBench, combining domain-specific metrics with human judgment for robust model comparison (Cheng et al., 26 Mar 2025, Kumar et al., 6 Oct 2025).
Safety, Alignment, and Prompt Optimization: Models like VPO speak to the necessity for integrated prompt filtering, descriptiveness, and harmful content mitigation, alongside automated feedback from both text and generated video (Cheng et al., 26 Mar 2025).
Multimodal and Continual Learning: Video generation will increasingly integrate audio, depth, and other modalities, with continual learning strategies mitigating catastrophic forgetting when updating with new content (Zanchetta et al., 21 Sep 2025).

Advancements in text-to-video generation models are rapidly expanding their applicability and realism, with diffusion-based and compositional paradigms currently leading the field in terms of perceptual quality, temporal fidelity, and alignment with user intent (Hassan et al., 18 Dec 2025, Patel, 30 Oct 2025, Li et al., 2023, Tian et al., 2024, Kumar et al., 6 Oct 2025).