Papers
Topics
Authors
Recent
Search
2000 character limit reached

SkyReels-V3: Unified Multimodal Video Generation

Updated 1 February 2026
  • SkyReels-V3 is a unified multimodal video generation system that synthesizes videos from images, text, and audio, ensuring temporal coherence and high fidelity.
  • It leverages a diffusion Transformer backbone with progressive denoising and cross-attention to handle tasks such as image-to-video synthesis, video continuation, and audio-guided avatar generation.
  • Benchmark evaluations show state-of-the-art performance in visual quality, temporal consistency, and instruction adherence, making it a valuable tool for cinematic synthesis and realistic avatar creation.

SkyReels-V3 is a conditional video generation system built upon a unified multimodal in-context learning framework employing diffusion Transformers. It supports three principal generative tasks—reference image-to-video synthesis, video-to-video extension, and audio-guided talking avatar generation—within a single architecture. The framework aims to enable high-fidelity, temporally coherent, and context-controllable video creation, approaching the performance of leading closed-source systems in quantitative and human-evaluated benchmarks (Li et al., 24 Jan 2026).

1. Unified Multimodal In-Context Learning Framework

SkyReels-V3 is centered around a multimodal in-context paradigm, enabling flexible combinations of visual references (images or videos), textual prompts, and audio waveforms as conditioning inputs. The model ingests a sequence of latent representations z0,,zTRH×W×Cz_0, \ldots, z_T \in \mathbb{R}^{H \times W \times C} corresponding to frames and applies a progressive denoising process via a Transformer-based U-Net backbone. Conditioning information is encapsulated in c={cimg,ctxt,caudio}c = \{c_{img}, c_{txt}, c_{audio}\}.

The forward corruption process follows a fixed-variance diffusion schedule:

q(ztzt1)=N(zt;αtzt1,(1αt)I)q(z_t | z_{t-1}) = \mathcal{N}\left(z_t; \sqrt{\alpha_t} z_{t-1}, (1-\alpha_t)I\right)

and the reverse denoising model:

pθ(zt1zt,c)=N(zt1;μθ(zt,t,c),Σθ(zt,t,c))p_\theta(z_{t-1} | z_t, c) = \mathcal{N}(z_{t-1}; \mu_\theta(z_t, t, c), \Sigma_\theta(z_t, t, c))

Each Transformer block incorporates a cross-attention stage: at every step tt, latents ztz_t are projected to tokens, self-attended, and then cross-attended against modality embeddings. The embeddings are encoded as follows: text with a frozen CLIP-like encoder, images/videos via a video VAE, and audio with a lightweight speech encoder. These are fused through a learned “token-fusion” head prior to denoising.

2. Diffusion Transformer Backbone

The denoiser, denoted ϵθ\epsilon_\theta, utilizes a stacked, U-shaped Diffusion Transformer (DiT) designed for video latents at variable resolutions r{64,128,256,512}r \in \{64, 128, 256, 512\}. At each resolution, the latent tensor ztrRhr×wr×Cz_t^r \in \mathbb{R}^{h_r \times w_r \times C} is tokenized and embedded with positional information. Generic Transformer blocks operate sequentially through self-attention, cross-attention with conditioning inputs, and MLP-residual connections, with sinusoidal time embeddings injected throughout.

Training employs the DDPM loss with classifier-free guidance:

L(θ)=Ez0,ϵ,t[ϵϵθ(αˉtz0+1αˉtϵ,t,c)2]L(\theta) = \mathbb{E}_{z_0, \epsilon, t} \left[ \|\epsilon - \epsilon_\theta(\sqrt{\bar{\alpha}_t} z_0 + \sqrt{1-\bar{\alpha}_t} \epsilon, t, c)\|^2 \right]

where c={cimg,ctxt,caudio}c = \{c_{img}, c_{txt}, c_{audio}\}0. Sampling interpolates denoiser outputs conditioned and unconditioned on c={cimg,ctxt,caudio}c = \{c_{img}, c_{txt}, c_{audio}\}1 to enforce stronger adherence to input conditions.

3. Core Generative Paradigms

SkyReels-V3 addresses three generative objectives:

3.1 Reference Image-to-Video Synthesis

Given c={cimg,ctxt,caudio}c = \{c_{img}, c_{txt}, c_{audio}\}2 reference images c={cimg,ctxt,caudio}c = \{c_{img}, c_{txt}, c_{audio}\}3 and a textual prompt c={cimg,ctxt,caudio}c = \{c_{img}, c_{txt}, c_{audio}\}4, SkyReels-V3 synthesizes up to 30s sequences (c={cimg,ctxt,caudio}c = \{c_{img}, c_{txt}, c_{audio}\}5 frames at 24 fps). The data pipeline comprises:

  • Cross-frame pairing: selecting K frames from videos to maximize pose/appearance diversity;
  • Image editing & semantic rewriting: subject regions are isolated/inpainted, and backgrounds restyled using text-to-image models to mitigate copy-paste artifacts;
  • Quality filtering: frames with excessive distortion are removed using a trained filter.

Hybrid training mixes 1.4B web images and 10M video clips per batch, employing 2D denoising for images and full 3D attention for video segments. Multi-resolution joint optimization trains over all four target sizes, with total loss summed over active resolutions.

Temporal and identity consistency are enforced through cross-frame pairing and hybrid training. Quantified metrics include:

Metric Value
Reference Consistency 0.6698
Instruction Following 27.22
Visual Quality (MOS) 0.8119

Where Reference Consistency is measured as cosine similarity of per-frame FaceNet embeddings, Instruction Following by CLIP score, and Visual Quality via human mean opinion score ((Li et al., 24 Jan 2026), Table 3.1).

3.2 Video-to-Video Extension

Given an input video c={cimg,ctxt,caudio}c = \{c_{img}, c_{txt}, c_{audio}\}6 of c={cimg,ctxt,caudio}c = \{c_{img}, c_{txt}, c_{audio}\}7 frames and a prompt c={cimg,ctxt,caudio}c = \{c_{img}, c_{txt}, c_{audio}\}8, two extension modes are supported:

  • Single-shot continuation: seamless extrapolation preserving camera angle and motion;
  • Multi-shot switching: transitions mimicking professional cinematography (cut-in, cut-out, shot/reverse, multi-angle, cut-away).

A learned CNN+Transformer module detects shot boundaries and tags transition patterns, passing this segment marker via positional encoding. A unified multi-segment positional encoding c={cimg,ctxt,caudio}c = \{c_{img}, c_{txt}, c_{audio}\}9, with q(ztzt1)=N(zt;αtzt1,(1αt)I)q(z_t | z_{t-1}) = \mathcal{N}\left(z_t; \sqrt{\alpha_t} z_{t-1}, (1-\alpha_t)I\right)0 indexing shot segments, supports hierarchical clips with up to 3 segments. Five shot-switching templates correspond to learned transition kernels that bias denoising interpolation at segment boundaries.

3.3 Audio-Guided Talking-Avatar Generation

For talking avatar generation, given a portrait q(ztzt1)=N(zt;αtzt1,(1αt)I)q(z_t | z_{t-1}) = \mathcal{N}\left(z_t; \sqrt{\alpha_t} z_{t-1}, (1-\alpha_t)I\right)1, an audio waveform q(ztzt1)=N(zt;αtzt1,(1αt)I)q(z_t | z_{t-1}) = \mathcal{N}\left(z_t; \sqrt{\alpha_t} z_{t-1}, (1-\alpha_t)I\right)2 (≤1 minute), and optional transcript q(ztzt1)=N(zt;αtzt1,(1αt)I)q(z_t | z_{t-1}) = \mathcal{N}\left(z_t; \sqrt{\alpha_t} z_{t-1}, (1-\alpha_t)I\right)3, the model synthesizes a temporally consistent video:

  • First-and-last frame insertion: synthetic anchor frames at q(ztzt1)=N(zt;αtzt1,(1αt)I)q(z_t | z_{t-1}) = \mathcal{N}\left(z_t; \sqrt{\alpha_t} z_{t-1}, (1-\alpha_t)I\right)4 and q(ztzt1)=N(zt;αtzt1,(1αt)I)q(z_t | z_{t-1}) = \mathcal{N}\left(z_t; \sqrt{\alpha_t} z_{t-1}, (1-\alpha_t)I\right)5 mask to the reference identity and are fixed during denoising;
  • Key-frame inference: key audio frames q(ztzt1)=N(zt;αtzt1,(1αt)I)q(z_t | z_{t-1}) = \mathcal{N}\left(z_t; \sqrt{\alpha_t} z_{t-1}, (1-\alpha_t)I\right)6 are identified at phoneme boundaries and matched with output from a trained lip-sync network q(ztzt1)=N(zt;αtzt1,(1αt)I)q(z_t | z_{t-1}) = \mathcal{N}\left(z_t; \sqrt{\alpha_t} z_{t-1}, (1-\alpha_t)I\right)7;
  • Audio-visual synchronization: auxiliary loss encourages correlation between generated lip landmarks and expected mouth-shape phoneme embeddings.

Performance on internal benchmarks:

Metric Value
Audio–Visual Sync (10) 8.18
Visual Quality (MOS) 4.60
Character Consistency 0.80

((Li et al., 24 Jan 2026), Table 3.2).

4. Model Training Procedure

Training utilizes large-scale datasets:

  • Images: 1.4B web images (e.g., LAION-5B filtered)
  • Videos: 10M in-house clips, 1M public clips (WebVid, HowTo100M)
  • Audio: 500 hours speech/singing (LibriSpeech, MeAd)

Optimization employs AdamW with q(ztzt1)=N(zt;αtzt1,(1αt)I)q(z_t | z_{t-1}) = \mathcal{N}\left(z_t; \sqrt{\alpha_t} z_{t-1}, (1-\alpha_t)I\right)8, q(ztzt1)=N(zt;αtzt1,(1αt)I)q(z_t | z_{t-1}) = \mathcal{N}\left(z_t; \sqrt{\alpha_t} z_{t-1}, (1-\alpha_t)I\right)9, weight decay pθ(zt1zt,c)=N(zt1;μθ(zt,t,c),Σθ(zt,t,c))p_\theta(z_{t-1} | z_t, c) = \mathcal{N}(z_{t-1}; \mu_\theta(z_t, t, c), \Sigma_\theta(z_t, t, c))0, on 16,384 A100 GPUs over 200B frames/tokens. Learning rate follows warmup to cosine decay (peak pθ(zt1zt,c)=N(zt1;μθ(zt,t,c),Σθ(zt,t,c))p_\theta(z_{t-1} | z_t, c) = \mathcal{N}(z_{t-1}; \mu_\theta(z_t, t, c), \Sigma_\theta(z_t, t, c))1). Batches per GPU include 32 videos (16–32 frames) and 32 images. Data augmentations involve random cropping/flipping, color and temporal jitter.

Training milestones include:

  • 4 weeks image-only pretraining,
  • 2 weeks image–video hybrid,
  • 1 week video-only fine-tuning on multi-shot transitions,
  • 1 week audio-aligned fine-tuning for avatars.

5. Performance Evaluation and Benchmarking

SkyReels-V3 is benchmarked on both proprietary and public datasets, measuring:

  • Reference Consistency (cosine identity embedding similarity)
  • Instruction Following (CLIP-based score)
  • Visual Quality (FID, FVD, human MOS)
  • Audio–Visual Sync (landmark–phoneme correlation)
  • Shot-switch quality (human continuity rating)

Comparative results indicate SkyReels-V3 matches or surpasses leading closed-source models (e.g., Sora, Seedance, Veo) across all core video generation tasks, reporting 2–5% relative gains in visual quality and 0.3 improvements in sync accuracy ((Li et al., 24 Jan 2026), Tables 3.1, 3.2).

6. Component Ablation and System Contributions

Ablation studies were conducted to quantify the impact of principal components:

Component Impact when removed/ablated
Hybrid image–video training –4% visual quality
Multi-resolution joint loss –2% reference consistency (512p)
Cross-frame pairing pipeline –7% identity preservation
Shot-switch positional encoding –0.12 MOS in cut transitions
Key-frame lip sync constraint –0.5 sync score

These findings affirm measurable contributions from multimodal in-context fusion, hierarchical spatiotemporal modeling, and specialized audio–visual alignment modules. Each component is instrumental for achieving fidelity, coherence, and controllability in generative outcomes.

7. Context and Significance within Video Generation Research

SkyReels-V3 demonstrates that a unified conditional generative system, leveraging diffusion Transformers with carefully designed pipelines and training regimes, can achieve state-of-the-art or near state-of-the-art video generation under diverse, multimodal prompt conditions. Its ability to sustain subject identity, comply with narrative instructions, enforce cinematographic transitions, and synchronize audio–visual content highlights its relevance as an open-source reference for cinematic-quality generative modeling (Li et al., 24 Jan 2026). Its architectural and methodological advances address major challenges in video synthesis, compositional stability, and instruction-conditioned fidelity, providing a scalable foundation for future research in world modeling and context-aware generative systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SkyReels-V3.