UniMAGE: Unified Imaginative Audio-Video Generation
- UniMAGE is a unified model that integrates script drafting and key-shot design by routing text and image tokens through specialized expert transformers.
- It employs a 'first interleaving, then disentangling' training paradigm to separately optimize narrative reasoning and visual generation for better multimodal outcomes.
- The model achieves state-of-the-art performance with measurable improvements, including enhanced character identity similarity and prompt adherence compared to previous methods.
UniMAGE is a unified director model for imaginative audio-video generation, integrating the traditionally disjoint processes of script drafting and key-shot design into a single framework. By leveraging the Mixture-of-Transformers (MoT) architecture, UniMAGE routes text and image tokens to specialized “expert” transformer branches, bridging user prompts to long-context, multi-shot film scripts and visually consistent keyframe images. The model introduces a “first interleaving, then disentangling” training paradigm and achieves state-of-the-art performance on open-source benchmarks for narrative coherence and visual quality (Zhang et al., 29 Dec 2025).
1. Architecture and Token Routing
UniMAGE is constructed atop the Mixture-of-Transformers (MoT) backbone derived from Bagel, employing expert sub-transformers per layer:
- Understanding Expert: Text-oriented, responsible for script reasoning.
- Generation Expert: Image-oriented, responsible for keyframe synthesis.
A lightweight router computes the gating distribution for each token (or block), modulating the expert contributions via
where is the scalar gating probability and is the output from expert .
All token types—text tokens (shared BPE vocabulary), ViT tokens (reference frames, frozen SigLIP2 ViT), and VAE tokens (latents from FLUX-based VAE)—are projected to a shared hidden size , mixed via multimodal self-attention, and routed accordingly.
2. Script and Data Representation
Scripts are linearized into sequences encapsulating user prompts, character/environment/frame identities, frame and video text annotations, and interleaved keyframe latents. A typical sequence is
where is the prompt, are text annotations, and are VAE latents. Embeddings for all tokens are unified for multimodal processing.
In-Context ID Prompting is implemented by interspersing identity tokens among image tokens, facilitating character/scene consistency throughout the narrative.
3. Training Paradigm: Interleaving then Disentangling
The UniMAGE training regime is organized into two sequential stages:
3.1 Interleaved Concept Learning (ICL)
- Joint training over multi-shot scripts (where is text, is keyframe latents).
- Unified autoregressive objective: where denotes the position preceding the th image block.
- Both (understanding) and (generation) receive gradient updates.
3.2 Disentangled Expert Learning (DEL)
- Script-only (pure text) samples update ; text+image pairs update with stop-gradient on .
- Objectives:
- (MSE over drift fields)
- Joint optimization:
with .
- Pre-Context Script Splitting augments text-only phase: scripts are randomly partitioned, and the understanding expert learns narrative continuation from prefixed context.
4. Optimization and Losses
Optimization employs AdamW with learning rate and a total of training steps. Primary losses include:
- Next Token Prediction:
- Rectified Flow for Images:
- Expert-Load Balancing (Optional):
The total stage-wise loss adds balancing terms as needed.
5. Inference Workflow
The inference pipeline follows these steps:
- Given user prompt , the Understanding Expert () generates the full script autoregressively: .
- Optionally, an extension or continuation is requested via designated tokens.
- Script is segmented into shots.
- For each shot , the Generation Expert () samples VAE keyframe latents , conditioned on corresponding text.
- Latents are decoded into images.
- Resulting script and images are forwarded to downstream audio-video generators (e.g., Veo3), along with extracted dialogue and sound tokens.
High-level pseudocode formally describes both training and inference routines, specifying dataset partitions and parameter initialization.
6. Datasets and Evaluation Metrics
Three principal datasets underpin the training stages:
- ICL: 450K multi-shot text–image scripts.
- DEL (text expert): 250K pure text scripts.
- DEL (image expert): 250K single-shot text–image pairs.
UniMAGE is evaluated on ViStoryBench using metrics: | Metric | Description | Reported UniMAGE SOTA | |-----------------------------|--------------------------------------------------|-----------------------| | Style Similarity (CSD) | Consistency of visual style | - | | Character ID Similarity (CIDS) | Character identity preservation | 59.2 (vs. 57.0) | | Prompt Adherence (Alignment)| Fidelity to user instructions | 80.8 (vs. 62.5) | | OCCM | On-stage character count matching | 88.07 (vs. 87.0) | | Image Quality (Inception) | Standard image quality measure | - | | Aesthetics | Human-rated visual appeal | ≈4.55 (vs. 5.76) |
A human study with 50 participants ranks UniMAGE highest for narrative logic (GSB=0.72), character consistency, and overall quality (Zhang et al., 29 Dec 2025).
7. Significance and Implementation Context
UniMAGE demonstrates the technical feasibility of unifying imaginative reasoning and visual generation in a single scalable framework, substantially improving character consistency and narrative logic in script-to-video pipelines. The architecture, grounded in expertly routed multimodal attention with explicit expert specializations, enables direct re-implementation using Bagel/MoT codebases, provided access to the well-described datasets and hyperparameters. This suggests broad applicability for automated film production and creative multimedia authoring, especially in contexts requiring long-context story coherence and complex visual composition (Zhang et al., 29 Dec 2025).