Corruption-Aware Training of Latent Video Diffusion Models for Robust Text-to-Video Generation

Published 24 May 2025 in cs.CV and cs.LG | (2505.21545v2)

Abstract: Latent Video Diffusion Models (LVDMs) have achieved state-of-the-art generative quality for image and video generation; however, they remain brittle under noisy conditioning, where small perturbations in text or multimodal embeddings can cascade over timesteps and cause semantic drift. Existing corruption strategies from image diffusion (e.g., Gaussian, Uniform) fail in video settings because static noise disrupts temporal fidelity. In this paper, we propose CAT-LVDM, a corruption-aware training framework with structured, data-aligned noise injection tailored for video diffusion. Our two operators, Batch-Centered Noise Injection (BCNI) and Spectrum-Aware Contextual Noise (SACN), align perturbations with batch semantics or spectral dynamics to preserve coherence. CAT-LVDM yields substantial gains: BCNI reduces FVD by 31.9 percent on WebVid-2M, MSR-VTT, and MSVD, while SACN improves UCF-101 by 12.3 percent, outperforming Gaussian, Uniform, and large diffusion baselines such as DEMO (2.3B) and LaVie (3B) despite training on 5x less data. Ablations confirm the unique value of low-rank, data-aligned noise, and theoretical analysis establishes why these operators tighten robustness and generalization bounds. CAT-LVDM thus introduces a principled framework for robust video diffusion and further demonstrates transferability to autoregressive generation and multimodal video understanding models.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper introduces CAT-LVDM, a corruption-aware training method that uses structured noise injection (BCNI and SACN) to combat error propagation in video diffusion models.
It achieves significant empirical gains with a 31.9% reduction in FVD and enhanced robustness across multiple T2V benchmarks, even with reduced training data.
The method’s theoretical framework compresses risk bounds from O(D) to O(d), offering improved sample efficiency and generalization under realistic corruption.

Corruption-Aware Training for Robust Latent Video Diffusion Models

Motivation and Problem Formulation

Latent Video Diffusion Models (LVDMs) have set benchmarks for generative text-to-video (T2V) synthesis by leveraging pretrained autoencoders and diffusion processes within highly-compressed spaces. However, their generative performance exhibits pronounced brittleness in the presence of corrupted conditioning—noise or misalignment within text or multimodal embeddings. This sensitivity differs fundamentally from static-image generative models, as error propagation across diffusion steps induces catastrophic semantic drift and loss of temporal coherence, with degradations amplifying recursively rather than remaining locally bounded.

Standard corruption strategies adapted from image diffusion (Gaussian, uniform, etc.) are not sufficient for video. The iterative structure and temporal dependencies intrinsic to video synthesis mean that indiscriminate, static noise disrupts both motion and semantic consistency. Thus, there is a need for corruption-aware training paradigms that respect video-specific statistical dependencies and alignment properties.

Method: Corruption-Aware Training (CAT-LVDM)

The paper introduces CAT-LVDM, a corruption-aware training protocol designed for video diffusion. Central to this approach are two structured, data-aligned noise injection operators:

Batch-Centered Noise Injection (BCNI): BCNI perturbs the conditioning embeddings along directions defined by sample deviations from the batch mean. This acts as a Mahalanobis-type regularizer, increasing model robustness selectively in the most semantically variable axes, which are empirically correlated with higher-entropy batchwise features. Theoretical analyses show that BCNI constrains noise to a $d \ll D$ subspace (with $d$ the effective semantic dimension), controlling the propagation of error in reverse diffusion and improving generalization bounds.
Spectrum-Aware Contextual Noise (SACN): SACN confines perturbations to embedding directions associated with the largest singular values of the batch covariance—corresponding to dominant, low-frequency spectral modes. This aligns corruption with globally coherent motion and content, rather than arbitrary, high-frequency details, and as such explicitly preserves temporal consistency.

Both techniques are parameter-free (outside of the noise scale $p$ ), lightweight to compute, and easily integrated into standard LVDM pipelines. Theoretical results rigorously establish that structured, low-rank corruption significantly tightens entropy, 2-Wasserstein, and score-drift bounds compared to full-rank isotropic corruption, with all core quantities scaling with $d$ rather than $D$ .

Empirical Results

Extensive experiments span four canonical T2V benchmarks (WebVid-2M, MSR-VTT, MSVD, UCF-101) and include 73 LVDM variants trained under seven embedding-level and five token-level noise strategies.

State-of-the-Art FVD Reduction: CAT-LVDM with BCNI achieves a 31.9% reduction in Fréchet Video Distance (FVD) across WebVid-2M, MSR-VTT, and MSVD relative to uncorrupted or standard-noise baselines, and SACN achieves a 12.3% FVD improvement on UCF-101. Notably, CAT-LVDM surpasses much larger models such as LaVie (3B), DEMO (2.3B), and Show-1 (6B), despite being trained on 5×–17× less data.
Generalization Across Regimes: BCNI produces stronger gains on caption-rich datasets (preserving appearance and action semantics), while SACN yields optimal results on action-centric, class-labeled data by minimizing motion flicker and misalignment.
Robustness and Sample Efficiency: Sensitivity analysis demonstrates that structured corruption strategies yield smoother FVD degradation as conditioning noise increases compared to brittle baselines. CAT-LVDM achieves comparable or superior results to baseline models even as the available training data is reduced by an order of magnitude.
Transferability: CAT-LVDM's structured corruption is not model-specific—it transfers to autoregressive generation backbones (e.g., NOVA, MAGVIT) and large video-LLMs (e.g., PAVE) without loss in robustness, validating the operator's backbone-agnostic utility.
Adversarial and Downstream Robustness: CAT-LVDM improves performance against adversarial or text corruption, outperforming adversarial training and heuristic text-perturbation defenses in both generative and scene-aware dialog settings (e.g., outperforming LLaVA-OV-0.5B-FT and PAVE on AVSD by a significant margin in CIDEr).
Ablation and Hyperparameter Stability: Across classifier-free guidance and DDIM step ablation, CAT-LVDM exhibits Pareto-optimal robustness across perceptual and pixel-level metrics (SSIM, LPIPS, PSNR), while unstructured corruptions show non-monotonic, brittle trends.

Theoretical Contributions

The paper provides a comprehensive theoretical framework showing that:

Low-rank, structured corruption increases conditional entropy and enlarges the support of the conditional distribution proportionally to the effective semantic rank $d$ , not $D$ .
Wasserstein and KL divergences between clean and corrupted distributions scale as $O(p\sqrt{d})$ , compared to $O(p\sqrt{D})$ under isotropic noise.
Mixing times, error magnification, functional inequalities (log-Sobolev, Talagrand T2), and Rademacher complexity bounds are all dimensionally compressed under CAT-LVDM.
Minimax lower bounds and entropic OT gaps substantiate a nontrivial improvement in statistical risk with structured corruption.
These analytical results explain the empirical efficiency and robustness gains seen across tasks.

Practical and Theoretical Implications

Practically, CAT-LVDM renders LVDMs robust to realistic input corruptions and web-scale noisy supervision, reducing the need for massive clean datasets. By aligning injected noise with the batch's semantic geometry or spectral energy, CAT-LVDM regularizes LVDMs for both generative fidelity and downstream video-language tasks, with minimal additional compute.

Theoretically, CAT-LVDM reframes the role of noise in diffusion pretraining—from a necessary evil to a controlled, geometry-aware regularizer that compresses every key risk bound from $O(D)$ to $O(d)$ scaling. This framework may be directly extendable to sequential reasoning, reinforcement learning, and embodied agents, where compounding error underlies performance collapse.

Limitations include lack of validation on very long-form, 3D, or ultra-HD videos, and the open question of interaction with varying encoder qualities and massive-scale multimodal data. Future work will investigate adaptive, end-to-end corruption operator learning, scaling to video transformers and multimodal LLMs, and explicit modeling of sequential entropy in embodied agents.

Conclusion

CAT-LVDM establishes a new paradigm for robust text-to-video generation via structured, data-aligned corruption. By aligning training perturbations with the intrinsic structure of video data, it yields not only strong empirical gains—and in several cases, state-of-the-art sample efficiency and robustness—but also a unified theoretical account of dimension-compressed generalization in video diffusion. The framework's extensibility to diverse generative and multimodal backbones suggests wide utility for future research on resilience under imperfect supervision.

Reference:

"Corruption-Aware Training of Latent Video Diffusion Models for Robust Text-to-Video Generation" (2505.21545)

Markdown Report Issue