UniMAGE: Unified Imaginative Audio-Video Generation

Updated 5 January 2026

UniMAGE is a unified model that integrates script drafting and key-shot design by routing text and image tokens through specialized expert transformers.
It employs a 'first interleaving, then disentangling' training paradigm to separately optimize narrative reasoning and visual generation for better multimodal outcomes.
The model achieves state-of-the-art performance with measurable improvements, including enhanced character identity similarity and prompt adherence compared to previous methods.

UniMAGE is a unified director model for imaginative audio-video generation, integrating the traditionally disjoint processes of script drafting and key-shot design into a single framework. By leveraging the Mixture-of-Transformers (MoT) architecture, UniMAGE routes text and image tokens to specialized “expert” transformer branches, bridging user prompts to long-context, multi-shot film scripts and visually consistent keyframe images. The model introduces a “first interleaving, then disentangling” training paradigm and achieves state-of-the-art performance on open-source benchmarks for narrative coherence and visual quality (Zhang et al., 29 Dec 2025).

1. Architecture and Token Routing

UniMAGE is constructed atop the Mixture-of-Transformers (MoT) backbone derived from Bagel, employing $E = 2$ expert sub-transformers per layer:

Understanding Expert: Text-oriented, responsible for script reasoning.
Generation Expert: Image-oriented, responsible for keyframe synthesis.

A lightweight router computes the gating distribution $r(x)$ for each token (or block), modulating the expert contributions via

$\mathrm{MoT}(x) = \sum_{i=1}^{E} r_i(x)\,\mathrm{Expert}_i(x), \quad \sum_{i=1}^E r_i(x)=1,$

where $r_i(x)$ is the scalar gating probability and $\mathrm{Expert}_i(x)$ is the output from expert $i$ .

All token types—text tokens $y_t$ (shared BPE vocabulary), ViT tokens $v_j$ (reference frames, frozen SigLIP2 ViT), and VAE tokens $z_k$ (latents from FLUX-based VAE)—are projected to a shared hidden size $d$ , mixed via multimodal self-attention, and routed accordingly.

2. Script and Data Representation

Scripts are linearized into sequences encapsulating user prompts, character/environment/frame identities, frame and video text annotations, and interleaved keyframe latents. A typical sequence is

$\bigl[\langle\text{User}\rangle,\,\rho,\,\langle\text{Character}1\rangle, \ldots, \langle\text{Frame}1\rangle, c_1, \langle\text{Video}1\rangle, c'_1, z_1, \langle\text{Frame}2\rangle, c_2, \ldots\bigr],$

where $\rho$ is the prompt, $c_i/c_i'$ are text annotations, and $z_i$ are VAE latents. Embeddings for all tokens are unified for multimodal processing.

In-Context ID Prompting is implemented by interspersing identity tokens among image tokens, facilitating character/scene consistency throughout the narrative.

3. Training Paradigm: Interleaving then Disentangling

The UniMAGE training regime is organized into two sequential stages:

3.1 Interleaved Concept Learning (ICL)

Joint training over multi-shot scripts $(s, I)$ (where $s$ is text, $I$ is keyframe latents).
Unified autoregressive objective: $\mathcal{L}_{\mathrm{ICL}} = \mathbb{E}_{(s,I)\sim D_{\mathrm{inter}}}\biggl[ -\sum_t\log p_\theta(s_t \mid s_{<t}, I_{<t}) -\sum_k\log p_\theta(I_k \mid s_{\leq t(k)}, I_{<k})\biggr],$ where $t(k)$ denotes the position preceding the $k$ th image block.
Both $\theta_U$ (understanding) and $\theta_G$ (generation) receive gradient updates.

3.2 Disentangled Expert Learning (DEL)

Script-only (pure text) samples update $\theta_U$ ; text+image pairs update $\theta_G$ with stop-gradient on $\theta_U$ .
Objectives:
- $\mathcal{L}_{\mathrm{text}} = -\sum_t\log p_{\theta_U}(y_t\mid y_{<t})$
- $\mathcal{L}_{\mathrm{img}} = \text{RectifiedFlowLoss}(I)$ (MSE over drift fields)
- Joint optimization:
$\mathcal{L}_{\mathrm{DEL}} = \lambda_{\mathrm{text}}\mathcal{L}_{\mathrm{text}} + \lambda_{\mathrm{img}}\mathcal{L}_{\mathrm{img}},$

with $\lambda_{\mathrm{text}} = \lambda_{\mathrm{img}} = 1$ .
Pre-Context Script Splitting augments text-only phase: scripts are randomly partitioned, and the understanding expert learns narrative continuation from prefixed context.

4. Optimization and Losses

Optimization employs AdamW with learning rate $1\mathrm{e}{-5}$ and a total of $30\mathrm{K}+10\mathrm{K}$ training steps. Primary losses include:

Next Token Prediction:

$\mathcal{L}_{\mathrm{NTP}} = -\sum_{t=1}^T\log p_\theta(y_t \mid y_{<t})$

Rectified Flow for Images:

$\mathcal{L}_{\mathrm{flow}} = \int_0^1 \mathbb{E}\left[\left\| (X_1 - X_0) - v_\phi(X_t, t) \right\|^2 \right]\,dt$

Expert-Load Balancing (Optional):

$\mathcal{L}_{\mathrm{bal}} = \alpha\sum_{i=1}^E\bigl(\bar r_i\log\bar r_i\bigr),\quad \bar r_i = \mathbb{E}_x[r_i(x)]$

The total stage-wise loss adds balancing terms as needed.

5. Inference Workflow

The inference pipeline follows these steps:

Given user prompt $\rho$ , the Understanding Expert ( $\theta_U$ ) generates the full script autoregressively: $\hat S = (\mathcal{G}, \mathcal{C})$ .
Optionally, an extension or continuation is requested via designated tokens.
Script $\hat S$ is segmented into $n$ shots.
For each shot $i$ , the Generation Expert ( $\theta_G$ ) samples VAE keyframe latents $f_i \equiv I_i$ , conditioned on corresponding text.
Latents $f_i$ are decoded into images.
Resulting script and images $(\hat S, \{f_i\})$ are forwarded to downstream audio-video generators (e.g., Veo3), along with extracted dialogue and sound tokens.

High-level pseudocode formally describes both training and inference routines, specifying dataset partitions and parameter initialization.

6. Datasets and Evaluation Metrics

Three principal datasets underpin the training stages:

ICL: 450K multi-shot text–image scripts.
DEL (text expert): 250K pure text scripts.
DEL (image expert): 250K single-shot text–image pairs.

UniMAGE is evaluated on ViStoryBench using metrics: | Metric | Description | Reported UniMAGE SOTA | |-----------------------------|--------------------------------------------------|-----------------------| | Style Similarity (CSD) | Consistency of visual style | - | | Character ID Similarity (CIDS) | Character identity preservation | 59.2 (vs. 57.0) | | Prompt Adherence (Alignment)| Fidelity to user instructions | 80.8 (vs. 62.5) | | OCCM | On-stage character count matching | 88.07 (vs. 87.0) | | Image Quality (Inception) | Standard image quality measure | - | | Aesthetics | Human-rated visual appeal | ≈4.55 (vs. 5.76) |

A human study with 50 participants ranks UniMAGE highest for narrative logic (GSB=0.72), character consistency, and overall quality (Zhang et al., 29 Dec 2025).

7. Significance and Implementation Context

UniMAGE demonstrates the technical feasibility of unifying imaginative reasoning and visual generation in a single scalable framework, substantially improving character consistency and narrative logic in script-to-video pipelines. The architecture, grounded in expertly routed multimodal attention with explicit expert specializations, enables direct re-implementation using Bagel/MoT codebases, provided access to the well-described datasets and hyperparameters. This suggests broad applicability for automated film production and creative multimedia authoring, especially in contexts requiring long-context story coherence and complex visual composition (Zhang et al., 29 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Bridging Your Imagination with Audio-Video Generation via a Unified Director (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to UniMAGE.

UniMAGE: Unified Imaginative Audio-Video Generation

1. Architecture and Token Routing

2. Script and Data Representation

3. Training Paradigm: Interleaving then Disentangling

3.1 Interleaved Concept Learning (ICL)

3.2 Disentangled Expert Learning (DEL)

4. Optimization and Losses

5. Inference Workflow

6. Datasets and Evaluation Metrics

7. Significance and Implementation Context

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

UniMAGE: Unified Imaginative Audio-Video Generation

1. Architecture and Token Routing

2. Script and Data Representation

3. Training Paradigm: Interleaving then Disentangling

3.1 Interleaved Concept Learning (ICL)

3.2 Disentangled Expert Learning (DEL)

4. Optimization and Losses

5. Inference Workflow

6. Datasets and Evaluation Metrics

7. Significance and Implementation Context

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research