Diffusion-Based Video Generation

Updated 30 January 2026

Diffusion-based video generation is a probabilistic method that extends denoising diffusion models from images to create high-fidelity, temporally coherent videos.
It employs a forward noise-adding process and a learned reverse denoising procedure using neural networks like 3D U-Nets and transformers to capture spatial and temporal dynamics.
Innovative conditioning and architectural enhancements enable fine-grained control for applications in text-to-video, video editing, and high-resolution interpolation.

Diffusion-based video generation comprises a class of probabilistic generative models that extend denoising diffusion processes—originally proposed for images—to synthesizing temporally coherent, high-fidelity videos. Such models learn to reverse a gradual noising process in high-dimensional video space (pixel or latent), yielding samples from data distributions that capture both spatial structure and rich temporal dynamics. This paradigm has led to state-of-the-art results across unconditional video synthesis, conditional generation (e.g., text-to-video, image-to-video), spatiotemporal interpolation, and controllable long-form video creation.

1. Foundational Principles and Mathematical Formulation

Diffusion-based video generation is grounded in denoising diffusion probabilistic models (DDPMs), which define a forward Markov chain that perturbs data by incrementally adding Gaussian noise, and a learned reverse process that denoises in finite steps. Given data $\mathbf{x}_0$ (a video clip), the forward process is

$q(\mathbf{x}_{1:T}|\mathbf{x}_0) = \prod_{t=1}^T q(\mathbf{x}_t|\mathbf{x}_{t-1}), \quad q(\mathbf{x}_t|\mathbf{x}_{t-1}) = \mathcal{N}(\sqrt{\alpha_t} \mathbf{x}_{t-1}, (1-\alpha_t)\mathbf{I}),$

with $\alpha_t$ a predefined schedule. The reverse process is parameterized by neural networks (typically U-Nets or transformers) that predict either noise or means at each timestep. Learning leverages the simplified denoising loss: $\mathcal{L} = \mathbb{E}_{\mathbf{x}_0, t, \epsilon} \left[\|\epsilon - \epsilon_\theta(\sqrt{\bar\alpha_t} \mathbf{x}_0 + \sqrt{1-\bar\alpha_t} \epsilon, t)\|^2\right]$ where $\epsilon_\theta$ predicts the noise, ensuring learned transition densities match the true reverse process (Wang et al., 22 Apr 2025).

Extensions to videos introduce additional challenges: temporal coherence, memory and compute scaling with sequence length, and modeling high-dimensional spatiotemporal statistics. Solutions include 3D convolutions, temporal attention, structured state-spaces, and latent-space diffusion (Oshima et al., 2024, Hu et al., 2023).

2. Architectural and Algorithmic Innovations

Three main algorithmic strategies have become dominant:

a. Spatiotemporal and Latent Backbones:

Early approaches adapted image U-Nets to video by adding 3D convolutions or interleaved spatial (2D conv/attention) and temporal (1D conv, temporal attention, state-space models) modules (Wang et al., 22 Apr 2025, Oshima et al., 2024). Architectures include 3D U-Nets for dense modeling (VDM (Yang et al., 2022)), factorized temporal blocks (e.g., Video-LDM), or diffusion transformers (DiT) in latent space (e.g., CogVideo, Sora) (Yuan et al., 16 Apr 2025). Latent diffusion further encodes each frame (or entire clips) via VAEs/VQGANs, reducing compute without sacrificing spatial detail (Hu et al., 2023, Lang et al., 2024).

b. Decomposition of Content and Motion:

Recent developments emphasize explicit separation of temporally static ("common") signals and frame-specific ("unique") or dynamic content. COMUNI (Sun et al., 2024) performs self-supervised decomposition via a CU-VAE into common and unique latents, then samples videos by holding the common code fixed and applying diffusion to the unique ones. Similar latent factorization schemes are present in LaMD (Hu et al., 2023), where content is encoded with a 2D U-Net and only motion is processed through diffusion over a bottleneck latent.

c. Advanced Conditioning and Control:

Methods such as DreaMoving (Feng et al., 2023) and DaS (Gu et al., 7 Jan 2025) provide fine-grained video control via pose, depth, identity, or even 3D trajectory input, integrating these as side-branches or ControlNet modules to restrict the diffusion process to physically plausible or user-intended motions. S2DM (Lang et al., 2024) introduces sector-shaped diffusion, jointly conditioning all frames on a shared stochastic "apex" while allowing distinct temporal variations guided by per-frame optical flow features.

3. Application Domains and Conditioning Modalities

Diffusion-based video generation now covers a spectrum of conditional and unconditional settings, including:

Text-to-video and Image-to-video: Text embeddings (e.g., CLIP, T5) or reference images provide semantic anchors, with conditioning applied via cross-attention throughout the diffusion backbone (Wang et al., 22 Apr 2025, Kim et al., 5 Feb 2025, Liang et al., 2023).
Physical/3D-Aware Generation: Diffusion4D (Liang et al., 2024) migrates temporal consistency to full 4D generation by incorporating explicit spatial and temporal motion embeddings and classifier-free guidance.
Control, Editing, and Manipulation: VideoControlNet (Hu et al., 2023) integrates motion information, I/P/B frame decomposition, and diffusion-driven inpainting for precise, temporally consistent video-to-video translation. DaS (Gu et al., 7 Jan 2025) uses 3D tracking signals for mesh-to-video, camera control, and object manipulation.
Interpolation, Super-resolution, and High-FPS: VIDIM (Jain et al., 2024) and DiffuseSlide (Hwang et al., 2 Jun 2025) apply diffusion for high-quality video interpolation and frame-rate upscaling, utilizing cascaded models and training-free sliding-window denoising.
Representation Learning: Divot (Ge et al., 2024) leverages diffusion as both a feature tokenizer and detokenizer for unified video comprehension and generation within LLM frameworks.

4. Optimization, Sampling, and Computational Efficiency

Efficiency in both training and sampling is a fundamental concern. Notable advances include:

Classifier-Free Guidance (CFG): Used pervasively, combining unconditional and conditional models to improve conditioning fidelity at generation time (Wang et al., 22 Apr 2025).
Accelerated Samplers: DDIM, DPM-Solver, and PLMS reduce steps by solving a deterministic probability-flow ODE; DiffuseSlide introduces sliding-window latent denoising and noise re-injection for training-free high FPS generation (Hwang et al., 2 Jun 2025).
Distillation and Step Reduction: AVDM2 distills multi-step diffusion teachers into few-step student models, using adversarial (GAN) and score distribution matching losses for high-quality four-step video synthesis (Zhu et al., 2024).
Memory/Latency Tactics: State-space models (SSM) (Oshima et al., 2024) provide linear scaling for long-term video, while token merging, dynamic loading, and on-device methods (e.g., On-Device Sora (Kim et al., 5 Feb 2025)) enable practical video diffusion on commodity hardware.
Temporal Non-uniformity: VGDFR (Yuan et al., 16 Apr 2025) exploits frame-wise motion variability by dynamically reducing latent sequence length in low-motion intervals, optimizing inference speed without perceptible loss.

5. Quantitative Evaluation and Benchmarks

Standard metrics in diffusion-based video models include:

Fréchet Video Distance (FVD): Assesses joint spatial and temporal realism via statistics on deep video embeddings (Wang et al., 22 Apr 2025, Hu et al., 2023, Liang et al., 2023).
LPIPS / CLIPSim / tFID: Evaluate perceptual similarity, prompt–video alignment, or temporally pooled frame quality, often needed to supplement FVD due to its sensitivity to sample size and embedding space.
User Studies and Preference Rates: Human raters frequently prefer diffusion-generated outputs for temporal coherence and structural realism (e.g., 74.7% for VideoControlNet over baselines (Hu et al., 2023); 54% for Diffusion4D in 4D asset overall preference (Liang et al., 2024)).
Specialized Scores: CRPS for probabilistic prediction, motion-fidelity metrics (e.g., warping errors), and domain-specific evaluations (geometry consistency, aesthetic, flicker, or frame rate).

A summary of selected model performance on established benchmarks is shown below:

Model	Dataset	FVD ↓	CLIPSim ↑	PSNR ↑	SSIM ↑
COMUNI	FaceForensics	6032	–	–	–
Diffusion4D	WebVid/Image→4D	560	0.75	16.7	0.83
LaMD	BAIR (16f)	57.0	–	–	–
S2DM	MHAD (16f)	99.0	–	–	–
VIDM	UCF101-16f	294.7	–	–	–

6. Challenges, Limitations, and Future Research Directions

Persistent technical obstacles include:

Motion Consistency and Flicker: Spatiotemporal drift and object identity switching remain, especially for long-form or fast-motion sequences. Approaches incorporating 3D/optical flow guidance, temporal filtering, or SSM temporal modules provide partial remedies (Sun et al., 2024, Oshima et al., 2024, Liang et al., 2023).
Computational Scaling: Linear or sublinear memory models (e.g., bidirectional SSM, token merging, frame-skipping) are increasingly preferred for scalable training and inference (Yuan et al., 16 Apr 2025, Kim et al., 5 Feb 2025).
Physical/3D and Semantic Priors: Fully 3D- or 4D-aware models such as Diffusion4D (Liang et al., 2024) and DaS (Gu et al., 7 Jan 2025) enable spatial-temporal control, but require high-quality geometric priors and large annotated datasets.
Unified Representation and Control: Divot (Ge et al., 2024) demonstrates a viable path towards video comprehension and generation in unified LLM architectures.
Benchmarking and Evaluation: FVD and similar metrics are sensitive to embedding choices; there is active work on comprehensive, robust benchmarking pipelines for both unconditional video realism and conditional prompt adherence.

7. Cross-Domain Applications and Impact

Diffusion-based video models extend well beyond generic video synthesis. Notable applications include:

Low-Level Vision: Denoising, deblurring, and super-resolution leverage strong priors from learned diffusion generators (Wang et al., 22 Apr 2025).
Content Manipulation and Editing: Control over appearance, motion, viewpoint, and semantic content, with application to personalized content creation, film post-processing, and virtual/augmented reality (Feng et al., 2023, Hu et al., 2023, Gu et al., 7 Jan 2025).
Representation Learning and Video-LLMs: Progress in self-supervised tokenizers such as Divot (Ge et al., 2024) integrates video with LLM frameworks for downstream video QA, captioning, and multimodal content generation.

Diffusion-based video generation is undergoing rapid, multi-faceted development, characterized by architectural diversification, increasing control and efficiency, and broadening application scope. The ongoing synthesis of spatiotemporal priors, scalable inference, and controllable semantics positions diffusion models as a central technology for next-generation video understanding and creation (Wang et al., 22 Apr 2025).