Papers
Topics
Authors
Recent
Search
2000 character limit reached

SkyReels-V2: Infinite-length Film Generative Model

Published 17 Apr 2025 in cs.CV | (2504.13074v3)

Abstract: Recent advances in video generation have been driven by diffusion models and autoregressive frameworks, yet critical challenges persist in harmonizing prompt adherence, visual quality, motion dynamics, and duration: compromises in motion dynamics to enhance temporal visual quality, constrained video duration (5-10 seconds) to prioritize resolution, and inadequate shot-aware generation stemming from general-purpose MLLMs' inability to interpret cinematic grammar, such as shot composition, actor expressions, and camera motions. These intertwined limitations hinder realistic long-form synthesis and professional film-style generation. To address these limitations, we propose SkyReels-V2, an Infinite-length Film Generative Model, that synergizes Multi-modal LLM (MLLM), Multi-stage Pretraining, Reinforcement Learning, and Diffusion Forcing Framework. Firstly, we design a comprehensive structural representation of video that combines the general descriptions by the Multi-modal LLM and the detailed shot language by sub-expert models. Aided with human annotation, we then train a unified Video Captioner, named SkyCaptioner-V1, to efficiently label the video data. Secondly, we establish progressive-resolution pretraining for the fundamental video generation, followed by a four-stage post-training enhancement: Initial concept-balanced Supervised Fine-Tuning (SFT) improves baseline quality; Motion-specific Reinforcement Learning (RL) training with human-annotated and synthetic distortion data addresses dynamic artifacts; Our diffusion forcing framework with non-decreasing noise schedules enables long-video synthesis in an efficient search space; Final high-quality SFT refines visual fidelity. All the code and models are available at https://github.com/SkyworkAI/SkyReels-V2.

Summary

  • The paper introduces SkyReels-V2, a generative model that produces infinite-length, high-resolution videos with consistent cinematic quality and detailed shot grammar.
  • It employs a multi-stage pretraining pipeline and a novel structured captioning strategy to enhance temporal coherence and prompt adherence.
  • The framework integrates rigorous data curation and motion-specific reinforcement learning to mitigate visual distortions and error accumulation in extended video synthesis.

SkyReels-V2: A Framework for Infinite-Length, Cinematic-Quality Video Generation

Introduction and Motivation

SkyReels-V2 introduces an open-source generative model designed to produce virtually unlimited-length, high-resolution, and cinematically consistent video. This work responds to persistent limitations of prior video generation methods, particularly regarding prompt adherence (especially to shot-scenario descriptions), motion dynamics, temporal consistency, and extended sequence scalability. Existing systems often optimize either visual fidelity or temporal coherence, but fail to harmonize both, especially for film-style outputs that demand complex shot grammar, entity consistency, and dynamic motion. SkyReels-V2 proposes architectural, data, and training innovations to advance this frontier. Figure 1

Figure 1: SkyReels-V2 produces cinematic, distortion-free, and visually consistent high-resolution videos of virtually unlimited length, excelling at maintaining the main subject’s integrity across all frames.

Methodology

Data Processing and Curation Pipeline

A robust data pipeline underpins model performance, incorporating large-scale diverse video sources (commercial films, TVs, web-mined content), systematic filtering for quality and genre balance, subtitle/logo removal, shot segmentation, and intensive human-in-the-loop validation. Subtitle and logo overlays, prevalent in media content, are addressed through spatial detection and cropping, using OCR and logo detectors to isolate and maximize usable frame area. Figure 2

Figure 3: Schematic of the multi-stage data processing pipeline, emphasizing automated filtering and human validation for quality assurance.

Figure 4

Figure 2: Subtitle and logo cropping pipeline—candidate regions are detected, then maximal interior rectangles beyond detected overlays are found and cropped.

Concept balance during the post-training data curation reduces bias toward prevalent subject categories, ensuring improved generalization and facilitating effective supervised fine-tuning. Figure 5

Figure 4: Distribution comparison between unbalanced (left) and concept-balanced (right) training data, demonstrating enhanced category uniformity.

Structural Video Captioning

One central advancement is the shift to a structured video captioning approach, fusing hierarchical MLLM-based descriptions with specialized shot-type, expression, and camera motion experts. This yields rich, interpretable representations critical for both model conditioning and downstream prompt adherence. Figure 6

Figure 5: Visual depiction of the structural caption, integrating subject, shot, expression, and camera movement fields.

SkyCaptioner-V1, the unified captioner, leverages distillation from domain-general and domain-specific models, achieving superior accuracy (average 76.3%, 93.7% shot-type field) compared to existing SOTA captioners across multiple fields pertinent to cinematic grammar.

Multi-Stage Pretraining and Enhancement

The base video generator is built with three-stage, progressive-resolution pretraining—moving from 256p to 540p—jointly on videos and concept-balanced images, normalizing duration and aspect ratios using a dual-axis bucketing strategy and FPS normalization to stabilize spatiotemporal heterogeneity. Flow matching is used as the generative objective to efficiently model complex temporal distributions.

Post-Training Optimization

Key advancements are realized in four post-training stages:

  • Concept-balanced SFT (540p) for initialization
  • Motion-specific Reinforcement Learning via Direct Preference Optimization (DPO) on human/model-annotated motion-distortion pairs
  • Diffusion Forcing Training: converts the full-sequence diffusion model into a diffusion-forcing transformer, enabling variable-length, "infinite" rollouts by assigning per-frame independent noise schedules under non-decreasing constraints, with adaptive denoising at inference for long-form, temporally consistent synthesis
  • High-Resolution SFT (720p) for final quality refinement

Motion preference annotation leverages a hybrid pipeline—curating and distorting real videos to explicitly simulate failure modes such as entity distortion and physics violations—to construct effective DPO training pairs. Figure 7

Figure 7

Figure 7

Figure 8: Example of V2V-induced slight facial corruption in generated video—typical of progressive motion-related artifacts addressed via motion-specific RL.

Model Performance

Systematic evaluation is conducted on both proprietary (SkyReels-Bench) and public (V-Bench 1.0) benchmarks. SkyReels-V2 demonstrates superior instruction adherence (3.15/5), visual consistency (3.35/5), and visual quality (3.34/5) on human assessments compared to other leading open and closed-source systems, with robust motion quality scores. On V-Bench, SkyReels-V2 attains the highest total and quality scores (83.9%, 84.7%), marginally outperforming strong open-source baselines and slightly trailing on purely semantic scores due to V-Bench's lower emphasis on shot grammar. Figure 9

Figure 6: Long-form video (30s+) generated from a single prompt—demonstrates temporal extension and intra-shot consistency.

Figure 10

Figure 7: Ultra-long video generated using sequential prompts for dynamic narrative progression while maintaining subject consistency.

Applications

SkyReels-V2 supports several advanced generative settings:

  • Story Generation: Enables multi-prompt, seamless narrative videos of unprecedented length while mitigating error accumulation through stabilization of previous frames during autoregressive rollouts.
  • Image-to-Video (I2V) Synthesis: Two approaches—input-frame conditioned SFT-fine-tuning and first-frame conditioning in diffusion-forcing—deliver state-of-the-art open-source results for preserving frame fidelity and subject identity.
  • Camera Director: Incorporates balanced and guided camera motion datasets for controllable, cinematic camera behaviors.
  • Elements-to-Video (E2V) Synthesis: SkyReels-A2 enables compositional, controllable video by combining reference images and textual prompts while preserving entity consistency. Figure 11

    Figure 9: Elements-to-video synthesis examples—compositions of multiple entity references and prompt-driven narrative.

Theoretical and Practical Implications

The integration of structured shot-aware captioning with multi-stage curriculum, motion-specific RL alignment, and diffusion-forcing inference establishes a new paradigm for aligning generative models with fine-grained cinematic conditioning and scalable generative fidelity. By uniting diffusion’s high spatial quality with autoregressive scalability (via diffusion-forcing and scheduling adaptations), the framework circumvents previous trade-offs—approximating the flexibility of LLM token generation at the frame level.

Practically, open-sourcing SkyReels-V2, its captioner, and elements-to-video codebases provides a resource for both research and industry to further probe, extend, and specialize film-quality video generation, supporting downstream applications in content creation, virtual cinematography, automated storyboarding, and beyond.

Limitations and Future Directions

Deterministic error accumulation remains a limitation in truly indefinite-duration rollouts, especially under long autoregressive continuations. Handling cross-scene, character, and story consistency at feature scale will require improved modeling of long-term dependencies, possibly through hierarchical memory, scene-graph tracking, or learned retrieval. The extension to richer conditioning modalities (audio, pose, interactive editing) and the further tightening of physics-aware motion priors present promising theoretical directions.

Conclusion

SkyReels-V2 delivers an empirically validated, publicly available framework for infinite-length, cinematic-caliber video generation. By addressing data, conditioning, alignment, and model architecture holistically, it achieves unmatched prompt adherence, temporal and motion quality, and video fidelity for open-source systems. Its innovations set foundational benchmarks and open numerous avenues for research and deployment in generative video modeling (2504.13074).

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 56 tweets with 493 likes about this paper.