Papers
Topics
Authors
Recent
Search
2000 character limit reached

Autoregressive Video Diffusion Models (AR-VDMs)

Updated 20 December 2025
  • Autoregressive Video Diffusion Models (AR-VDMs) are generative frameworks that produce video frames sequentially with causal conditioning and diffusion denoising.
  • They enable scalable long-range video synthesis, efficient streaming, and modular control through advanced memory and temporal factorization techniques.
  • Key challenges include error accumulation, memory bottlenecks, and balancing compression efficiency with high-fidelity generation.

Autoregressive Video Diffusion Models (AR-VDMs) are a subclass of generative video models in which the video sequence is produced framewise or chunkwise via conditional denoising diffusion, subject to strict temporal causality: each frame or latent group is generated conditioned only on past observations and/or auxiliary inputs. This formalism enables scalable long-range synthesis, efficient streaming, and modular control, while presenting unique architectural and theoretical challenges concerning error accumulation, memory capacity, compression, and generation efficiency. AR-VDMs subsume a range of ancestor models, including teacher-forced bidirectional VDMs, causal transformers, streaming residual diffusions, diffusion-token AR-LMs, and masked AR-planning pipelines.

1. Model Architectures and Temporal Factorization

Autoregressive video diffusion models employ distinct architectural mechanisms to ensure causal generation. Common factorization is

p(x1:T)=t=1Tp(xtx<t,ct),p(x_{1:T}) = \prod_{t=1}^T p(x_t \mid x_{<t}, c_t),

where xtx_t denotes the latent (or pixel) state of frame tt and ctc_t can encode text, trajectory, reference frames, or domain priors (Li et al., 2024, Yang et al., 2022, Gao et al., 2024, Zhao et al., 9 Oct 2025). Within each generative step, the model applies a multi-step or few-step diffusion denoising chain over SS discrete timesteps, starting from noise and iteratively reconstructing the clean frame conditioned on prior context.

Several variants exist:

  • Continuous Token AR-LMs: DiCoDe (Li et al., 2024) introduces “diffusion-compressed deep tokens” trained with a diffusion decoder; AR-LMs (GPT, Llama) autoregressively model these tokens, yielding \sim1000×\times sequence compression versus conventional VQ methods.
  • Causal Attention Transformers: ViD-GPT (Gao et al., 2024) and Ca²-VDM (Gao et al., 2024) employ strictly lower-triangular attention masks, ensuring each token or frame attends only to earlier positions, transforming per-clip bidirectional VDMs into scalable causal generators.
  • Residual Correction Decomposition: Compress-RNN or U-Net predictors synthesize a deterministic next-frame guess, augmented by stochastic diffusion-generated residuals (Yang et al., 2022).
  • Masked AR Planning: MarDini (Liu et al., 2024) and ARLON (Li et al., 2024) decouple coarse temporal planning (via AR-transformers over VQ codes) from high-res spatial generation (diffusion denoising conditioned on planning signals).

Hybrid architectures incorporate stateful memory: RAD (Chen et al., 17 Nov 2025) and VideoSSM (Yu et al., 4 Dec 2025) fuse windowed local caches with global recurrent (LSTM, SSM) or compressed representations to maintain long-range consistency.

2. Diffusion Frameworks and Training Regimes

The diffusion backbone is typically a discretized stochastic process, Markovian in timestep tt: q(xtxt1)=N(xt;1βtxt1,βtI),q(x_t \mid x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t}\, x_{t-1}, \beta_t I), with closed-form marginalization: xt=αˉtx0+1αˉtϵ,ϵN(0,I),x_t = \sqrt{\bar\alpha_t}\, x_0 + \sqrt{1-\bar\alpha_t}\, \epsilon, \quad \epsilon \sim \mathcal{N}(0, I), where αˉt=s=1t(1βs)\bar\alpha_t = \prod_{s=1}^t (1-\beta_s) (Li et al., 2024, Zhao et al., 9 Oct 2025, Yang et al., 2022).

Reverse denoising is either ϵ\epsilon-prediction or x0x_0-estimation, optionally parameterized for each frame ii by past context and current timestep embedding. Training objectives include standard MSE losses, variational lower-bounds, or score-matching (Yang et al., 2022).

Many AR-VDMs introduce additional regularization or scheduling:

3. Causal Attention, Memory, and Compression Mechanisms

Architectural innovations focus on efficient temporal context:

Model Temporal Mechanism Memory Handling
ViD-GPT Causal attn, KV-cache Reuse of all past KV features
Ca²-VDM Causal attn, cache sharing Fixed-size FIFO, O(KlPₘₐₓ) cost
RAD/VideoSSM LSTM/SSM hybrid Sliding window + state compression
AR-Diffusion Temporal causal attn Non-decreasing timesteps
DiCoDe AR LLM over tokens 1000× deep-token compression
MarDini Masked AR planning over VQ Asymmetric high/low-res attention

Notably, cache reuse in Ca²-VDM (Gao et al., 2024) converts AR generation complexity from quadratic to linear in sequence length, while memory-augmented models (RAD, VideoSSM (Chen et al., 17 Nov 2025, Yu et al., 4 Dec 2025)) enable hundreds to thousands of temporally coherent frames via fusion of local attention with state-space or LSTM compression.

Compression bottlenecks are analytically demonstrated in Meta-ARVDM (Wang et al., 12 Mar 2025): the KL-divergence between generated and true videos grows with both error accumulation and an unavoidable “memory bottleneck.” Practical architectures mitigate this by prepending, channel-concatenating, or cross-attending to compressed summaries of past frames, with ablation studies confirming empirical trade-offs.

4. Sampling Pipelines, Inference Efficiency, and Streaming Generation

AR-VDMs support variable-length and streaming inference by virtue of their causal structure. Typical sampling involves:

  1. Prepare conditioning (text prompt, motion trajectory, previous frame cache).
  2. For each new frame/chunk:
    • Sample initial latent from N(0,I)\mathcal{N}(0, I) or re-corrupt previous frames for progressive schedules.
    • Iteratively denoise via diffusion steps, using cached context and adaptive attention.
    • Update memory or cache (local or global as applicable).
  3. Concatenate or stitch frames to assemble video sequence.

Optimizations include:

5. Evaluation Methods and Empirical Results

State-of-the-art AR-VDMs are evaluated on short-clip (MSR-VTT, UCF-101), long-form (minute-scale), and action-conditioned benchmarks (DMLab, Minecraft). Metrics include:

  • Fréchet Video Distance (FVD): Lower is better; models such as Ca²-VDM (Gao et al., 2024), AR-Diffusion (Sun et al., 10 Mar 2025), and DiCoDe (Li et al., 2024) consistently achieve or approach SOTA FVD scores across datasets (e.g., Ca²-VDM: FVD=181 on MSR-VTT, AR-Diffusion: FVD₁₆=186.6 on UCF-101).
  • CLIPSIM/IS scores: Semantic alignment to prompt; DiCoDe (Li et al., 2024) and ART·V (Weng et al., 2023) are competitive with much larger pretrain baselines.
  • Motion Consistency/Smoothness: AR-Drag (Zhao et al., 9 Oct 2025): 4.37 motion consistency, latency 0.44s, clearly exceeding previous controllable motion VDMs.
  • Temporal continuity and drift: ViD-GPT (Gao et al., 2024) introduces Step-FVD/ΔEdgeFD for chunkwise drift analysis; qualitative plots show flat frame-difference curves and lower stepwise FVD versus bidirectional or less causal baselines.
  • Streaming and interpolation: MarDini (Liu et al., 2024) achieves SOTA FVD=99.05 for interpolation (17 frames @512) and sub-second per-frame latency without image pretraining.

Theoretical analysis (Meta-ARVDM (Wang et al., 12 Mar 2025)) reveals a Pareto frontier: increased context reduces memory bottleneck at the expense of error accumulation and efficiency.

6. Limitations, Open Problems, and Prospective Directions

Several persistent limitations arise from both theory and empirical studies:

  • Memory Bottleneck: Information-theoretically unavoidable unless unbounded context is accommodated. Memory budget compression trades off between global consistency and error propagation (Wang et al., 12 Mar 2025).
  • Error Accumulation: AR sampling accumulates KL-divergence with rollout length; mitigated by richer cache, SSM, or progressive noise schedules (Xie et al., 2024, Yu et al., 4 Dec 2025).
  • Scene boundaries and stochastic motion: Tokenizer architecture (e.g., DiCoDe (Li et al., 2024)) assumes smooth reconstructibility between head/tail tokens; hard scene cuts or stochastic dynamics may degrade results.
  • Data and domain bias: WebVid/YouTube sources lead to imbalance (nonrigid > rigid scenes); long-form rigid motion modeling remains weaker (Li et al., 2024, Li et al., 2024).
  • Scalability: Fixed context windows or linear memory capacity saturate at minute-scale; hybrid or multi-scale memory modules remain an active research area (Yu et al., 4 Dec 2025, Chen et al., 17 Nov 2025).
  • Efficient adaptation and robustness: Noise schedule adaptation, adaptive compression, LoRA-style feedforward refiners (e.g., AutoRefiner (Yu et al., 12 Dec 2025)), and robust teacher/student pipelines are evolving to minimize artifacts under long autoregressive rollouts.

Future directions target:

7. Representative Models and Comparative Table

Model AR Mechanism Core Innovation Notable Results
DiCoDe (Li et al., 2024) AR-LM, diff-compressed tokens 1000× token compression Minute-long, scalable video, FVD=367 (16f)
Ca²-VDM (Gao et al., 2024) Causal attention, cache sharing Linear complexity, KV reuse FVD=181 (MSR-VTT), 52s/80 frames
RAD (Chen et al., 17 Nov 2025) DiT+LSTM hybrid, pre-fetch Frame-wise AR, memory fusion Improved SSIM/LPIPS, >1000 frames
MarDini (Liu et al., 2024) Masked AR planning, asymmetric S-T planning/generation split SOTA FVD interpolation, 0.5s/frame
AR-Diffusion (Sun et al., 10 Mar 2025) Non-decreasing steps, causal FoPP/AD scheduler, async-gen FVD=40.8 (Sky), best cross-domain
ARLON (Li et al., 2024) AR VQ-VAE + DiT fusion Norm-based semantic injection Top dynamic/consistency/efficiency
VideoSSM (Yu et al., 4 Dec 2025) State-space global memory Hybrid SSM+local cache Best minute-scale fidelity/stability
AutoRefiner (Yu et al., 12 Dec 2025) Pathwise reflective LoRA Context-sensitive noise refinement +0.7 VBench, 6fps, no reward hacking

Each model’s strengths and empirical advances are defined by architectural choices concerning temporal causality, memory handling, tokenization/compression, and cache efficiency.


Autoregressive video diffusion models have redefined the scalability and fidelity envelope for generative video synthesis, achieving minute-long coherent motion, interactive real-time control, and streaming-friendly architectures. Their foundation in causal modeling, efficient memory, and compressor-guided AR mechanisms continues to drive active research at the intersection of multimodal generation, memory theory, and scalable inference.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Autoregressive Video Diffusion Models (AR-VDMs).