2D Video Diffusion Model

Updated 28 January 2026

2D video diffusion models are generative frameworks that extend image diffusion techniques to video by integrating temporal conditioning and 2D latent representations.
They employ architectures like 2D UNet backbones combined with temporal attention and efficient latent mapping to enhance video synthesis and restoration.
Their design tackles challenges in temporal consistency and computational efficiency, achieving significant speed-ups and reduced memory usage while maintaining visual quality.

A 2D video diffusion model is a generative framework that extends the principles and architectural strengths of 2D image diffusion models to the temporally structured domain of video, typically by leveraging 2D or pseudo-2D latent spaces, temporal conditioning modules, and advanced attention mechanisms. Such models decouple the modeling of per-frame visual fidelity from temporally coherent synthesis, enabling scalable and controllable generation, restoration, and modification of video content.

1. Mathematical Formulation and Latent Diffusion Process

2D video diffusion models generalize the denoising diffusion probabilistic model (DDPM) to video sequences either in pixel-space or lower-dimensional 2D-structured latent spaces. Given a video $x_0\in\mathbb{R}^{T\times H\times W\times C}$ , the forward noising process applies a Markov chain independently or jointly over the temporal sequence of frames: $q(x^t_i|x^{t{-}1}_i) = \mathcal{N}( x^t_i; \sqrt{1-\beta_t} x^{t{-}1}_i, \beta_t I ),$ with closed-form

$x^t_i = \sqrt{\bar\alpha_t} x^0_i + \sqrt{1-\bar\alpha_t}\,\varepsilon, \quad \varepsilon\sim\mathcal{N}(0,I),$

where $\alpha_t=1-\beta_t$ and $\bar\alpha_t=\prod_{s=1}^t \alpha_s$ (Li et al., 17 Jan 2025, Yu et al., 2023, Blattmann et al., 2023).

The learned reverse process is

$p_\theta(x^{t-1}_i|x^t_i, \text{cond}) = \mathcal{N}( x^{t-1}_i;\; \mu_\theta(x^t_i, t, \text{cond}),\; \sigma_t^2 I ),$

with conditioning vectors including spatial features, temporal context, or external priors (Li et al., 17 Jan 2025). In latent-space models, the same formulation applies to the latent representations $z_0=E(x_0)$ (e.g., spatial 2D latents per frame, triplane projections, or compact motion latents in content–motion decompositions) (Yu et al., 2024, Yu et al., 2023).

The loss is a simple denoising objective: $L_\text{simple} = \mathbb{E}_{t, i, \varepsilon} \left[ \| \varepsilon - \varepsilon_\theta(z^t_i, t, \text{cond}) \|^2 \right]$ optionally augmented by perceptual, temporal, or adversarial terms depending on application (Li et al., 17 Jan 2025, Zhu et al., 2024).

2. Network Architectures and Temporal Modeling

The core architectures in 2D video diffusion include:

2D UNet Backbone: A spatial UNet (often from large-scale image models such as Stable Diffusion v1.5/2.1 (Blattmann et al., 2023)) is retained as the backbone for denoising. Temporal information is either injected via architectural augmentations or by concatenating frame sequences as input channels (Li et al., 17 Jan 2025, Long et al., 2024, Mei et al., 2022).
Temporal Modules: Motion and temporal coherence are addressed through:
- Interleaved multi-head temporal attention after spatial convolution/attention blocks (Blattmann et al., 2023, Li et al., 17 Jan 2025).
- Temporal convolutions (e.g., $1\times1\times3$ kernels) (Blattmann et al., 2023).
- Window-based temporal self-attention (WTSA), designed for implicit alignment without explicit optical flow or deformable convolutions (Long et al., 2024).
- Cross-clip or sector diffusion strategies (e.g., sector-shaped/ray-shaped models) for enforcing shared semantics with frame-wise temporal variation (Lang et al., 2024).
Latent and Factorized Representations: To manage memory and computation, videos are mapped to efficient latent spaces:
- Per-frame or triplane 2D latents (PVDM, CMD) (Yu et al., 2023, Yu et al., 2024).
- Content–motion decomposition (CMD), in which a video is represented as a single content frame with motion-specific latents recovered by a lightweight, motion-only diffusion process (Yu et al., 2024).
Conditional and Adaptive Control: Advanced models (e.g., OmniVDiff (Xi et al., 15 Apr 2025)) admit arbitrary modality roles and multi-modal conditioning, leveraging learned embeddings that determine whether each input channel (RGB, depth, segmentation, edge) is generated or supplied as context.

3. Temporal Consistency, Conditioning, and Long-Range Generation

Temporal consistency remains a central challenge, addressed through several designs:

Temporal Smoothing and Overlapping Windowing: Models such as DiffuEraser employ staggered denoising (alternating between even/odd frame anchors), averaging predictions at window overlaps to smooth cross-clip boundaries (Li et al., 17 Jan 2025).
Prior Injection and Weak Conditioning: Initialization with priors, often in the form of DDIM-inverted latents from separate lightweight models, provides global spatial context and prevents hallucinations or diffusion noise artifacts (Li et al., 17 Jan 2025).
Expanded Receptive Field via Pre-inference: Applying preliminary inference over sparsely sampled sequences and fusing context increases the model's effective temporal receptive field, enabling long-range dependencies despite local windowing of the main network (Li et al., 17 Jan 2025).
Sector-shaped and Ray-shaped Diffusion: S2DM provides a framework where all frames are reverse-sampled from a shared noise seed, maintaining semantic and stochastic consistency via sector-shaped regions in latent space (Lang et al., 2024).
Equivariance and GP-based Warping: Warped Diffusion propagates spatially correlated noise aligned by optical-flow-derived warpings, enforcing equivariance via test-time gradient corrections to ensure temporally smooth and spatially coherent generations even when using independent image diffusion models (Daras et al., 2024).

4. Efficiency, Memory, and Scalability

2D video diffusion models are notable for their computational and memory efficiency:

Projected Latent Spaces: PVDM encodes the cubic $T\times H\times W$ video into three triplane latents ( $\{z^s,z^h,z^w\}$ ), reducing both compute and storage; O(HW + TW + TH) vs. O(THW) for a traditional approach, enabling high-resolution and long sequence training on limited hardware (Yu et al., 2023).
Content–Motion Decomposition: CMD’s factorization of video into a single high-fidelity frame and small motion latents offers 7-11 $\times$ speed-up and major FLOP/memory savings over conventional latent video diffusion models, with no loss in visual quality (Yu et al., 2024).
Efficient Inference: Distribution-matched distilled models (AVDM2) can generate full videos in four sampling steps, compared to the 25+ steps typical of teacher models, through combined adversarial and 2D score-matching losses (Zhu et al., 2024).

5. Applications and Empirical Performance

2D video diffusion models are applied broadly in generation, manipulation, and restoration:

Video Inpainting: DiffuEraser achieves state-of-the-art video inpainting performance, delivering sharper details and enhanced temporal consistency in large masked regions (+1.2 dB PSNR, $~$ 15% tOF reduction, $~$ 25% FVD reduction relative to strong baselines) (Li et al., 17 Jan 2025).
Video Deblurring: DIVD leverages windowed temporal self-attention to outstrip previous approaches on perceptual metrics such as FID, LPIPS, and NIQE, recovering fine details without explicit flow or alignment (Long et al., 2024).
Text-to-Video Generation: Stable Video Diffusion, MoVideo, CMD, and S2DM support competitive text-conditional video synthesis, multi-view 3D priors, and motion fidelity, with models such as SVD achieving FVD = 242.0 on UCF-101 and CMD attaining 7 $\times$ acceleration over prior approaches (Blattmann et al., 2023, Liang et al., 2023, Yu et al., 2024, Lang et al., 2024).
Multi-modal Video Understanding and Synthesis: OmniVDiff enables joint RGB, depth, segmentation, and edge video generation or prediction in a single diffusion model, supporting both cross-modal generation and informed understanding (Xi et al., 15 Apr 2025).
Controllable and 4D-Guided Video Rendering: Techniques such as Generative Rendering inject 3D-to-2D correspondence (UV maps, depth) to control content and motion in zero-shot stylized video synthesis using standard 2D image diffusion backbones, achieving leading frame consistency and prompt fidelity (Cai et al., 2023).
Refinement of 3D-Based Generative Pipelines: ScenDi demonstrates that a 2D video diffusion stage can substantially enhance appearance fidelity and temporal consistency over purely 3D diffusion models in complex urban scene generation (Guo et al., 21 Jan 2026).

6. Quantitative Benchmarks and Ablation Insights

Extensive evaluation demonstrates the practical effectiveness and design tradeoffs of 2D video diffusion models:

FVD and Perceptual Metrics: State-of-the-art FVD values for text-to-video and video restoration tasks underscore the strength of 2D video diffusion—e.g., FVD = 639.7 (PVDM-L, UCF-101 128f) (Yu et al., 2023); FVD = 242.0 (SVD) (Blattmann et al., 2023); FID = 2.17 vs. prior of 19.36 on deblurring (Long et al., 2024).
Ablations: Studies confirm the necessity of curated pretraining data (Blattmann et al., 2023), shared-noise strategies (Lang et al., 2024), and denoised conditioning (Lapid et al., 2023) for optimum performance. Shared-noise sector diffusion achieves 33% lower FVD than comparable non-shared approaches (Lang et al., 2024).
Representation Learning: Video-diffusion-trained representations outperform image-trained counterparts in action recognition, depth estimation, tracking, and fine-grained classification due to the capture of spatiotemporal features (Vélez et al., 10 Feb 2025).
Efficiency: Factors such as triplane latents, motion-latent factorization, and distilled few-step samplers contribute to major improvements in speed and scalability, with CMD running 7.7 $\times$ faster than prior art and using $<$ 40\% of the memory/compute (Yu et al., 2024), and AVDM2 matching multi-step quality in only four steps (Zhu et al., 2024).

7. Limitations, Extensions, and Future Directions

Despite substantial advances, challenges remain:

Domain Generalization: Residual domain gap (e.g., between real and generated depth) can cause quality degradation that must be addressed by denoised or domain-adaptive conditioning (Lapid et al., 2023).
Perceptual/Temporal Tradeoffs: Per-frame fidelity and temporal consistency are often in tension, particularly in settings lacking explicit temporal conditioning or sector/ray construction (Yu et al., 2024, Lang et al., 2024).
Explicit Temporal Control: Many models require explicit or separately generated temporal codes (optical flow, depth), and end-to-end learning of both semantic and motion priors remains an open research direction (Lang et al., 2024, Liang et al., 2023).
Scalability: Large-model inference costs and the need for accelerated sampling strategies persist as practical limitations for long/higher-resolution sequences (Xi et al., 15 Apr 2025, Blattmann et al., 2023).

Potential extensions include learned multi-modal priors, autoregressive or continuous-time score-based models, joint depth/RGB/segmentation pipelines, and further integration with 3D and 4D generation frameworks.

Key references: (Li et al., 17 Jan 2025, Yu et al., 2023, Blattmann et al., 2023, Yu et al., 2024, Zhu et al., 2024, Long et al., 2024, Liang et al., 2023, Mei et al., 2022, Lang et al., 2024, Xi et al., 15 Apr 2025, Guo et al., 21 Jan 2026, Cai et al., 2023, Lapid et al., 2023, Daras et al., 2024, Vélez et al., 10 Feb 2025, Parthasarathy et al., 2024).