MOVA: Towards Scalable and Synchronized Video-Audio Generation

Published 9 Feb 2026 in cs.CV and cs.SD | (2602.08794v2)

Abstract: Audio is indispensable for real-world video, yet generation models have largely overlooked audio components. Current approaches to producing audio-visual content often rely on cascaded pipelines, which increase cost, accumulate errors, and degrade overall quality. While systems such as Veo 3 and Sora 2 emphasize the value of simultaneous generation, joint multimodal modeling introduces unique challenges in architecture, data, and training. Moreover, the closed-source nature of existing systems limits progress in the field. In this work, we introduce MOVA (MOSS Video and Audio), an open-source model capable of generating high-quality, synchronized audio-visual content, including realistic lip-synced speech, environment-aware sound effects, and content-aligned music. MOVA employs a Mixture-of-Experts (MoE) architecture, with a total of 32B parameters, of which 18B are active during inference. It supports IT2VA (Image-Text to Video-Audio) generation task. By releasing the model weights and code, we aim to advance research and foster a vibrant community of creators. The released codebase features comprehensive support for efficient inference, LoRA fine-tuning, and prompt enhancement.

Abstract PDF Upgrade to Chat

Summary

The paper presents an open-source dual-tower diffusion transformer for synchronized video-audio generation, achieving robust lip-sync and semantic alignment.
It integrates a 14B video diffusion transformer and a 1.3B audio DiT via a bidirectional bridge module with modified RoPE for precise temporal alignment.
Empirical results demonstrate significant improvements in audiovisual fidelity and multimodal alignment over cascaded and proprietary systems.

MOVA: Scalable Synchronized Video-Audio Generation via Asymmetric Dual-Tower Diffusion

Motivation and Problem Formulation

The integration of audio and video in generative modeling is an unresolved challenge due to modality-specific complexities and bidirectional alignment requirements. Existing video generation models, including diffusion-based transformers such as Sora, Wan, and OpenSora, have advanced visual fidelity, long-form temporal consistency, and semantic controllability, but fail to capture the indispensable audio component. Cascaded pipelines (e.g., video-to-audio or audio-to-video) propagate alignment errors, incur increased computational overhead, and restrict cross-modal interaction, resulting in suboptimal multimodal synthesis. Proprietary end-to-end systems (Veo 3, Sora 2) demonstrate joint generation capabilities, but their closed-source nature limits reproducibility and benchmarking, stalling research progress.

MOVA directly addresses these gaps by introducing an open-source, scalable diffusion transformer for joint video-audio generation, explicitly targeting synchronized lip-synced speech, environmental sound effects, and content-aligned music. The model is designed for IT2VA (Image-Text to Video-Audio) tasks, releasing both weights and code with comprehensive support for efficient inference, LoRA fine-tuning, and prompt enhancement, positioning itself as a community baseline for audio-visual modeling.

Model Architecture: Asymmetric Dual-Tower Diffusion Transformer

MOVA implements a dual-tower architecture composed of a 14B-param video diffusion transformer (Wan2.2 I2V A14B backbone) and a 1.3B-param audio DiT, coupled via a 2.6B bidirectional Bridge module. Both towers operate on latent spaces defined by pretrained VAEs: Wan2.1 video VAE for spatiotemporal compression and a DAC-style audio VAE from HunyuanVideo-Foley for audio representation. The Bridge module integrates hidden states from each modality through cross-attention at every interaction layer, facilitating rich bidirectional information transfer.

Temporal token alignment, critical for synchronization, is achieved by modifying RoPE positional encoding: video indices are rescaled to match dense audio token timelines, ensuring cross-modal queries/keys correspond to consistent physical times and mitigating drift (see [18] for comparative approaches). Dual sigma-shift noise scheduling enables independent and modality-specific denoising trajectories during flow-matching-based training and inference, decoupling the innate complexities and token densities of video and audio streams.

Classifier-Free Guidance (CFG) is generalized to dual conditioning, allowing fine-grained control over text and cross-modal alignment via tunable scaling parameters. This dual CFG architecture provides explicit control over the trade-off between semantic fidelity, lip-synchronization, and instruction following at inference.

Data Engineering and Captioning Pipeline

High-quality bimodal training data is a central requirement. MOVA curates >100,000 hours of synchronized video-audio content with fine-grained multimodal annotations. The data pipeline consists of:

Stage 1: Distributed preprocessing (Ray framework) for decoding, remuxing, aspect-ratio normalization, cropping, segmentation (8.05 s, 24 fps, 720p), and VAD/scene transition analysis.
Stage 2: Multidimensional quality assessment. Audiobox evaluates signal and aesthetic audio; DOVER assesses video technical and aesthetic quality; SynchFormer and ImageBind quantify audio-visual temporal and semantic alignment. Aggressive thresholds filter for high-fidelity, well-aligned samples.
Stage 3: Modality-specific captioning. MiMo-VL-7B (video) and Qwen3-Omni-Instruct/Captioner (audio/speech) generate detailed descriptions. GPT-OSS-120B merges single-modality captions, resolving cross-modal conflicts and producing unified, context-rich natural language narratives.

This pipeline retains only data with strong cross-modal correspondence, enabling robust phoneme-to-viseme mapping essential for accurate lip-sync and general sound alignment.

Training and Optimization Strategies

Training proceeds in two stages:

Audio tower pretraining: Wan2.1-style DiT trained on diverse domains (WavCaps, VGGSound, JamendoMaxCaps, in-house TTS) with explicit duration control.
Joint training: End-to-end optimization of video, audio, and Bridge modules with heterogeneous learning rates (Bridge: $2 \times 10^{-5}$ , towers: $1 \times 10^{-5}$ ) to balance fast cross-modal convergence and unimodal prior preservation. Dual sigma-shift schedules allow each modality to follow its natural denoising trajectory; Phase 1 uses aggressive denoising for video and gradual for audio, with higher text dropout forcing Bridge learning; Phase 2 aligns audio denoising and noise schedules for timbre fidelity; Phase 3 upscales to 720p with increased context parallelism and fine-tuning on highest-quality data subset.

Optimization exploits FSDP parameter sharding and sequence parallelism with custom memory management for stable 1024-GPU runs and 35% MFU. Alternating optimization for high-/low-noise branches in the MoE video tower ensures computational efficiency.

Generation and Inference Protocol

MOVA offers a multi-stage inference workflow tailored for IT2VA and T2VA generation. Visual grounding extraction via Qwen3-VL provides structured descriptions (style, cinematography, elements, OCR text), facilitating prompt enhancement through LLMs (e.g., Gemini 2.5). Synthesized prompts integrate static attributes and temporal dynamics, bridging distributional gaps between user inputs and training data. MOVA then leverages dual conditioning for temporally consistent video and audio generation. The workflow supports zero-shot synthesis for text-only prompts with placeholder images, showcasing emergent capabilities in unrestricted T2VA scenarios.

Dual CFG allows trade-off tuning between text fidelity and cross-modal alignment during inference, with various NFE regimes (2 or 3 branches) enabling flexible guidance.

Empirical Evaluation and Quantitative Results

MOVA is systematically evaluated on Verse-Bench and a custom benchmark covering diverse categories (multi-speaker, movie, sports, games, camera motion, anime). Metrics include:

Audio fidelity/diversity: Inception Score (IS) up to 4.269 (360p), DNSMOS up to 3.797.
Semantic alignment: IB-Score (0.315), DeSync (0.351 with dual CFG), substantial improvements over Ovi [18] and LTX-2 [66].
Lip-synchronization: LSE-C (7.800) and LSE-D (7.004, dual CFG), outperforming baselines.
Speaker identity attribution: cpCER (0.149, 720p), demonstrating robust multi-speaker attribution.
Emergent T2VA capability: Text-only conditioning (MOVA-360p-T2VA) surpasses image-conditioned baseline in IS and DeSync.

Human preference studies via Arena-style evaluation with ELO ratings confirm MOVA's superiority: win rates exceeding 50% and ELO >1113.8 relative to WAN+MMAudio cascades, Ovi, and LTX-2 baselines.

Ablation results demonstrate the impact of scaling (360p $\rightarrow$ 720p), dual CFG parameters, and prompt refinement. Dual CFG improves alignment and lip-sync metrics but induces conditional interference that degrades instruction-following on multi-speaker sequences as guidance scale increases.

Implications, Limitations, and Future Directions

MOVA establishes an open, scalable baseline for joint video-audio generation, with practical implications for real-world multimedia content creation, avatar animation, dubbing, and synchronous audiovisual synthesis. The architecture and training protocols demonstrate that large-scale, unified bimodal diffusion models achieve competitive or superior synchronization and perceptual quality to modular or proprietary systems.

Limitations include constrained audio modeling capacity for singing/music and complex sound textures, sequence length bottlenecks, and residual errors in multi-speaker synchronization due to covert speaker transitions and annotation reliability. Future research directions include hierarchical and blockwise generation, improved compression schemes for long-context modeling, more advanced audio tower architectures, and enhanced active-speaker detection.

On the theoretical side, decoupled noise schedules and dual CFG reveal intricate dependencies between modality-specific priors and alignment signals. There is scope for explicit physical reasoning and causal structure enforcement, particularly in event-driven synchronization scenarios and physics-aware audiovisual modeling.

Conclusion

MOVA introduces a scalable, open-source dual-tower diffusion transformer for synchronized video-audio generation, coupled with a bidirectional Bridge and temporally aligned RoPE. Advanced data curation and captioning pipelines provide rich multimodal supervision. Empirical results demonstrate strong numerical performance across audio fidelity, alignment, lip-sync, and speaker attribution metrics, with evidence of emergent capabilities in zero-shot T2VA generation. MOVA addresses crucial challenges in modeling, scaling, and data, and provides system-level optimizations for high-throughput training. By releasing weights and code, MOVA enables reproducible benchmarking, facilitating the advancement of joint audiovisual generation architectures in multimodal generative AI.