Real-time Streaming Joint Audio-Visual Generation

Develop real-time streaming methods for joint audio-visual generation that can produce audio and video concurrently within a single framework.

Background

Prior streaming diffusion efforts primarily address unimodal (video-only) generation and do not jointly synthesize audio and video in real time. Cascaded pipelines (e.g., generate video then audio, or vice versa) break the joint temporal distribution and hinder low-latency streaming, underscoring the need for a unified streaming solution.

The paper positions this gap as a central unresolved challenge before introducing OmniForcing as a proposed framework to address it by distilling a bidirectional joint audio-visual model into a real-time autoregressive generator.

References

Achieving real-time streaming for joint audio-visual generation remains an open, highly compelling, and unresolved problem.

OmniForcing: Unleashing Real-time Joint Audio-Visual Generation  (2603.11647 - Su et al., 12 Mar 2026) in Section 2, Related Work