Adapting VACE for Real-Time Autoregressive Video Diffusion

Published 16 Feb 2026 in cs.CV and cs.AI | (2602.14381v1)

Abstract: We describe an adaptation of VACE (Video All-in-one Creation and Editing) for real-time autoregressive video generation. VACE provides unified video control (reference guidance, structural conditioning, inpainting, and temporal extension) but assumes bidirectional attention over full sequences, making it incompatible with streaming pipelines that require fixed chunk sizes and causal attention. The key modification moves reference frames from the diffusion latent space into a parallel conditioning pathway, preserving the fixed chunk sizes and KV caching that autoregressive models require. This adaptation reuses existing pretrained VACE weights without additional training. Across 1.3B and 14B model scales, VACE adds 20-30% latency overhead for structural control and inpainting, with negligible VRAM cost relative to the base model. Reference-to-video fidelity is severely degraded compared to batch VACE due to causal attention constraints. A reference implementation is available at https://github.com/daydreamlive/scope.

Abstract PDF Upgrade to Chat

Summary

The paper presents a dual-path conditioning architecture that decouples reference frames from video latents to preserve the autoregressive cache in real-time settings.
It demonstrates effective integration of control modalities such as depth, scribble/edge, and optical flow with minimal latency overhead, achieving 17–22 FPS at 1.3B scale.
The approach is retraining-free, though quality issues in reference-to-video generation highlight trade-offs between causal inference and bidirectional attention.

Adapting VACE for Real-Time Autoregressive Video Diffusion: An Expert Analysis

Introduction

The paper "Adapting VACE for Real-Time Autoregressive Video Diffusion" (2602.14381) addresses the integration of VACE’s unified video control mechanisms—originally developed for batch, bidirectional diffusion inference—into streaming, autoregressive video generation pipelines. Such pipelines require strict causal attention for real-time, chunked frame production, a constraint that fundamentally conflicts with the batch-oriented architecture of VACE. The study proposes an architectural adaptation that relocates reference frame conditioning outside the diffusion latent sequence and into a parallel pathway, enabling efficient autoregressive generation while supporting depth control, scribble/edge conditioning, optical flow, inpainting, outpainting, and temporal extension.

Architectural Adaptation for Streaming Generation

The principal contribution is a dual-path conditioning architecture. In conventional VACE, reference images are concatenated to the noisy latent sequence and processed with bidirectional attention—an approach that impedes fixed-size chunking and corrupts the KV cache in streaming transformers through semantic contamination and positional encoding misalignments.

The adaptation segregates reference frames, funneling them into a set of pretrained Context Blocks that operate independently from the chunked video latents. Contextual “hint” signals, derived from these blocks, are injected into the main DiT path via additive, zero-initialized projections, preserving the integrity of the autoregressive cache and supporting flexible control modalities.

Reference Conditioning and Model Reuse

A significant finding is that no retraining is necessary for this adaptation. The zero-initialized projections of the Context Blocks and their learned weights from batch VACE are directly portable. For most VACE primitives—structural conditioning, masking, layout—the adaptive hints function as intended, matching the capabilities of batch inference but within a streaming framework. For reference-to-video generation (R2V), however, quality is severely degraded due to the lack of cross-attention between the references and video latents under causal-only attention.

Streaming Control Capabilities

Extensive experimental validation demonstrates the deterministic streaming compatibility of nearly all VACE features. Structural control modes—depth, scribble/edge, optical flow, and grayscale/colorization—as well as layout-based control and masked generation (inpainting/outpainting) operate in real time.

Figure 2: Structural control modes with input, extracted conditioning, and generated outputs for depth, scribble/edge, optical flow, and grayscale/colorization controls.

Auto-detection of operational modes from inputs, dual-stream encoding to distinguish inactive/reactive content, and complete cache management strategies support robust operation across five Wan2.1-based autoregressive pipelines (including LongLive, Krea Realtime Video, and others) without model-specific modifications. Control adherence is quantitatively measured for depth (RMSE 0.157) and inpainting (SSIM 0.983) on challenging prompt sets.

Figure 4: Masked generation, layout control, and temporal extension results, each achieved in real-time autoregressive mode.

Performance and Efficiency

The approach adds 20–30% latency overhead for control or masking, with VRAM consumption incrementally increased by the Context Blocks (~1.4 GB on 1.3B models and negligible on 14B models). On commodity GPUs, real-time throughputs of 17–22 FPS are demonstrated at 1.3B scale for $368 \times 640$ resolution. Temporal extension exhibits near-baseline latency after reference hint caching in extension mode, especially at 1.3B where the overhead becomes marginal. No specialized tuning was required for the context scale.

Limitations

The architecture inherits several limitations from both the streaming and VACE paradigms:

Temporal coherence degrades for sequences exceeding 100 frames, demanding external anchoring or periodic re-initialization.
Control fidelity varies with the quality and type of the provided conditioning signal.
R2V quality is substantially degraded; bidirectional attention is fundamentally required for high-fidelity reference transfer.
Temporal extension is less expressive than in batch setup due to chunking constraints.
No direct perceptual quality comparison against batch-mode VACE is feasible because of architectural and data differences confounding attribution.

Practical and Theoretical Implications

The work provides a pathway for unifying advanced video control within streaming autoregressive models using pretrained control adapters, without necessitating new model retraining or intrusive architectural changes. This is particularly relevant as real-time video synthesis finds application in content creation, interactive media, and live streaming, where consistent frame latency and memory requirements are critical.

However, the inherent shortcomings in reference-based and cross-attentive controls expose the tension between causal inference and information fusion. The study suggests further work on hybrid attention architectures, chunk-overlapping strategies, or knowledge distillation techniques to bridge this quality gap without sacrificing streaming compatibility.

The compositional approach (additive hint injection, mode detection) also facilitates future integrations with modular LoRA techniques and real-time object-driven controls (e.g., via YOLO), expanding the flexibility of the pipeline.

Conclusion

This adaptation of VACE paves the way for deploying advanced, multi-modal video controls in real-time, autoregressive diffusion models, with minimal engineering effort and inference overhead. The solution demonstrates that architectural decoupling of reference frames and chunk-local video latents is effective for streaming, provided that bidirectional dependency is unnecessary. Persistent challenges in reference-to-video fidelity and long-range coherence remain and point to fruitful directions for future research on controlled video diffusion under causal generation constraints.

Markdown Report Issue