SkyReels-V2: Infinite-length Film Generative Model

Published 17 Apr 2025 in cs.CV | (2504.13074v3)

Abstract: Recent advances in video generation have been driven by diffusion models and autoregressive frameworks, yet critical challenges persist in harmonizing prompt adherence, visual quality, motion dynamics, and duration: compromises in motion dynamics to enhance temporal visual quality, constrained video duration (5-10 seconds) to prioritize resolution, and inadequate shot-aware generation stemming from general-purpose MLLMs' inability to interpret cinematic grammar, such as shot composition, actor expressions, and camera motions. These intertwined limitations hinder realistic long-form synthesis and professional film-style generation. To address these limitations, we propose SkyReels-V2, an Infinite-length Film Generative Model, that synergizes Multi-modal LLM (MLLM), Multi-stage Pretraining, Reinforcement Learning, and Diffusion Forcing Framework. Firstly, we design a comprehensive structural representation of video that combines the general descriptions by the Multi-modal LLM and the detailed shot language by sub-expert models. Aided with human annotation, we then train a unified Video Captioner, named SkyCaptioner-V1, to efficiently label the video data. Secondly, we establish progressive-resolution pretraining for the fundamental video generation, followed by a four-stage post-training enhancement: Initial concept-balanced Supervised Fine-Tuning (SFT) improves baseline quality; Motion-specific Reinforcement Learning (RL) training with human-annotated and synthetic distortion data addresses dynamic artifacts; Our diffusion forcing framework with non-decreasing noise schedules enables long-video synthesis in an efficient search space; Final high-quality SFT refines visual fidelity. All the code and models are available at https://github.com/SkyworkAI/SkyReels-V2.

Abstract PDF Upgrade to Chat

Summary

The paper introduces SkyReels-V2, a generative model that produces infinite-length, high-resolution videos with consistent cinematic quality and detailed shot grammar.
It employs a multi-stage pretraining pipeline and a novel structured captioning strategy to enhance temporal coherence and prompt adherence.
The framework integrates rigorous data curation and motion-specific reinforcement learning to mitigate visual distortions and error accumulation in extended video synthesis.

SkyReels-V2: A Framework for Infinite-Length, Cinematic-Quality Video Generation

Introduction and Motivation

SkyReels-V2 introduces an open-source generative model designed to produce virtually unlimited-length, high-resolution, and cinematically consistent video. This work responds to persistent limitations of prior video generation methods, particularly regarding prompt adherence (especially to shot-scenario descriptions), motion dynamics, temporal consistency, and extended sequence scalability. Existing systems often optimize either visual fidelity or temporal coherence, but fail to harmonize both, especially for film-style outputs that demand complex shot grammar, entity consistency, and dynamic motion. SkyReels-V2 proposes architectural, data, and training innovations to advance this frontier.

Figure 1: SkyReels-V2 produces cinematic, distortion-free, and visually consistent high-resolution videos of virtually unlimited length, excelling at maintaining the main subject’s integrity across all frames.

Methodology

Data Processing and Curation Pipeline

A robust data pipeline underpins model performance, incorporating large-scale diverse video sources (commercial films, TVs, web-mined content), systematic filtering for quality and genre balance, subtitle/logo removal, shot segmentation, and intensive human-in-the-loop validation. Subtitle and logo overlays, prevalent in media content, are addressed through spatial detection and cropping, using OCR and logo detectors to isolate and maximize usable frame area.

Figure 3: Schematic of the multi-stage data processing pipeline, emphasizing automated filtering and human validation for quality assurance.

Figure 2: Subtitle and logo cropping pipeline—candidate regions are detected, then maximal interior rectangles beyond detected overlays are found and cropped.

Concept balance during the post-training data curation reduces bias toward prevalent subject categories, ensuring improved generalization and facilitating effective supervised fine-tuning.

Figure 4: Distribution comparison between unbalanced (left) and concept-balanced (right) training data, demonstrating enhanced category uniformity.

Structural Video Captioning

One central advancement is the shift to a structured video captioning approach, fusing hierarchical MLLM-based descriptions with specialized shot-type, expression, and camera motion experts. This yields rich, interpretable representations critical for both model conditioning and downstream prompt adherence.

Figure 5: Visual depiction of the structural caption, integrating subject, shot, expression, and camera movement fields.

SkyCaptioner-V1, the unified captioner, leverages distillation from domain-general and domain-specific models, achieving superior accuracy (average 76.3%, 93.7% shot-type field) compared to existing SOTA captioners across multiple fields pertinent to cinematic grammar.

Multi-Stage Pretraining and Enhancement

The base video generator is built with three-stage, progressive-resolution pretraining—moving from 256p to 540p—jointly on videos and concept-balanced images, normalizing duration and aspect ratios using a dual-axis bucketing strategy and FPS normalization to stabilize spatiotemporal heterogeneity. Flow matching is used as the generative objective to efficiently model complex temporal distributions.

Post-Training Optimization

Key advancements are realized in four post-training stages:

Concept-balanced SFT (540p) for initialization
Motion-specific Reinforcement Learning via Direct Preference Optimization (DPO) on human/model-annotated motion-distortion pairs
Diffusion Forcing Training: converts the full-sequence diffusion model into a diffusion-forcing transformer, enabling variable-length, "infinite" rollouts by assigning per-frame independent noise schedules under non-decreasing constraints, with adaptive denoising at inference for long-form, temporally consistent synthesis
High-Resolution SFT (720p) for final quality refinement

Motion preference annotation leverages a hybrid pipeline—curating and distorting real videos to explicitly simulate failure modes such as entity distortion and physics violations—to construct effective DPO training pairs.

Figure 8: Example of V2V-induced slight facial corruption in generated video—typical of progressive motion-related artifacts addressed via motion-specific RL.

Model Performance

Systematic evaluation is conducted on both proprietary (SkyReels-Bench) and public (V-Bench 1.0) benchmarks. SkyReels-V2 demonstrates superior instruction adherence (3.15/5), visual consistency (3.35/5), and visual quality (3.34/5) on human assessments compared to other leading open and closed-source systems, with robust motion quality scores. On V-Bench, SkyReels-V2 attains the highest total and quality scores (83.9%, 84.7%), marginally outperforming strong open-source baselines and slightly trailing on purely semantic scores due to V-Bench's lower emphasis on shot grammar.

Figure 6: Long-form video (30s+) generated from a single prompt—demonstrates temporal extension and intra-shot consistency.

Figure 7: Ultra-long video generated using sequential prompts for dynamic narrative progression while maintaining subject consistency.

Applications

SkyReels-V2 supports several advanced generative settings:

Story Generation: Enables multi-prompt, seamless narrative videos of unprecedented length while mitigating error accumulation through stabilization of previous frames during autoregressive rollouts.
Image-to-Video (I2V) Synthesis: Two approaches—input-frame conditioned SFT-fine-tuning and first-frame conditioning in diffusion-forcing—deliver state-of-the-art open-source results for preserving frame fidelity and subject identity.
Camera Director: Incorporates balanced and guided camera motion datasets for controllable, cinematic camera behaviors.
Elements-to-Video (E2V) Synthesis: SkyReels-A2 enables compositional, controllable video by combining reference images and textual prompts while preserving entity consistency.
Figure 9: Elements-to-video synthesis examples—compositions of multiple entity references and prompt-driven narrative.

Theoretical and Practical Implications

The integration of structured shot-aware captioning with multi-stage curriculum, motion-specific RL alignment, and diffusion-forcing inference establishes a new paradigm for aligning generative models with fine-grained cinematic conditioning and scalable generative fidelity. By uniting diffusion’s high spatial quality with autoregressive scalability (via diffusion-forcing and scheduling adaptations), the framework circumvents previous trade-offs—approximating the flexibility of LLM token generation at the frame level.

Practically, open-sourcing SkyReels-V2, its captioner, and elements-to-video codebases provides a resource for both research and industry to further probe, extend, and specialize film-quality video generation, supporting downstream applications in content creation, virtual cinematography, automated storyboarding, and beyond.

Limitations and Future Directions

Deterministic error accumulation remains a limitation in truly indefinite-duration rollouts, especially under long autoregressive continuations. Handling cross-scene, character, and story consistency at feature scale will require improved modeling of long-term dependencies, possibly through hierarchical memory, scene-graph tracking, or learned retrieval. The extension to richer conditioning modalities (audio, pose, interactive editing) and the further tightening of physics-aware motion priors present promising theoretical directions.

Conclusion

SkyReels-V2 delivers an empirically validated, publicly available framework for infinite-length, cinematic-caliber video generation. By addressing data, conditioning, alignment, and model architecture holistically, it achieves unmatched prompt adherence, temporal and motion quality, and video fidelity for open-source systems. Its innovations set foundational benchmarks and open numerous avenues for research and deployment in generative video modeling (2504.13074).