Papers
Topics
Authors
Recent
Search
2000 character limit reached

UniMAGE: Unified Imaginative Audio-Video Generation

Updated 5 January 2026
  • UniMAGE is a unified model that integrates script drafting and key-shot design by routing text and image tokens through specialized expert transformers.
  • It employs a 'first interleaving, then disentangling' training paradigm to separately optimize narrative reasoning and visual generation for better multimodal outcomes.
  • The model achieves state-of-the-art performance with measurable improvements, including enhanced character identity similarity and prompt adherence compared to previous methods.

UniMAGE is a unified director model for imaginative audio-video generation, integrating the traditionally disjoint processes of script drafting and key-shot design into a single framework. By leveraging the Mixture-of-Transformers (MoT) architecture, UniMAGE routes text and image tokens to specialized “expert” transformer branches, bridging user prompts to long-context, multi-shot film scripts and visually consistent keyframe images. The model introduces a “first interleaving, then disentangling” training paradigm and achieves state-of-the-art performance on open-source benchmarks for narrative coherence and visual quality (Zhang et al., 29 Dec 2025).

1. Architecture and Token Routing

UniMAGE is constructed atop the Mixture-of-Transformers (MoT) backbone derived from Bagel, employing E=2E = 2 expert sub-transformers per layer:

  • Understanding Expert: Text-oriented, responsible for script reasoning.
  • Generation Expert: Image-oriented, responsible for keyframe synthesis.

A lightweight router computes the gating distribution r(x)r(x) for each token (or block), modulating the expert contributions via

MoT(x)=i=1Eri(x)Experti(x),i=1Eri(x)=1,\mathrm{MoT}(x) = \sum_{i=1}^{E} r_i(x)\,\mathrm{Expert}_i(x), \quad \sum_{i=1}^E r_i(x)=1,

where ri(x)r_i(x) is the scalar gating probability and Experti(x)\mathrm{Expert}_i(x) is the output from expert ii.

All token types—text tokens yty_t (shared BPE vocabulary), ViT tokens vjv_j (reference frames, frozen SigLIP2 ViT), and VAE tokens zkz_k (latents from FLUX-based VAE)—are projected to a shared hidden size dd, mixed via multimodal self-attention, and routed accordingly.

2. Script and Data Representation

Scripts are linearized into sequences encapsulating user prompts, character/environment/frame identities, frame and video text annotations, and interleaved keyframe latents. A typical sequence is

[User,ρ,Character1,,Frame1,c1,Video1,c1,z1,Frame2,c2,],\bigl[\langle\text{User}\rangle,\,\rho,\,\langle\text{Character}1\rangle, \ldots, \langle\text{Frame}1\rangle, c_1, \langle\text{Video}1\rangle, c'_1, z_1, \langle\text{Frame}2\rangle, c_2, \ldots\bigr],

where ρ\rho is the prompt, ci/cic_i/c_i' are text annotations, and ziz_i are VAE latents. Embeddings for all tokens are unified for multimodal processing.

In-Context ID Prompting is implemented by interspersing identity tokens among image tokens, facilitating character/scene consistency throughout the narrative.

3. Training Paradigm: Interleaving then Disentangling

The UniMAGE training regime is organized into two sequential stages:

3.1 Interleaved Concept Learning (ICL)

  • Joint training over multi-shot scripts (s,I)(s, I) (where ss is text, II is keyframe latents).
  • Unified autoregressive objective: LICL=E(s,I)Dinter[tlogpθ(sts<t,I<t)klogpθ(Ikst(k),I<k)],\mathcal{L}_{\mathrm{ICL}} = \mathbb{E}_{(s,I)\sim D_{\mathrm{inter}}}\biggl[ -\sum_t\log p_\theta(s_t \mid s_{<t}, I_{<t}) -\sum_k\log p_\theta(I_k \mid s_{\leq t(k)}, I_{<k})\biggr], where t(k)t(k) denotes the position preceding the kkth image block.
  • Both θU\theta_U (understanding) and θG\theta_G (generation) receive gradient updates.

3.2 Disentangled Expert Learning (DEL)

  • Script-only (pure text) samples update θU\theta_U; text+image pairs update θG\theta_G with stop-gradient on θU\theta_U.
  • Objectives:

    • Ltext=tlogpθU(yty<t)\mathcal{L}_{\mathrm{text}} = -\sum_t\log p_{\theta_U}(y_t\mid y_{<t})
    • Limg=RectifiedFlowLoss(I)\mathcal{L}_{\mathrm{img}} = \text{RectifiedFlowLoss}(I) (MSE over drift fields)
    • Joint optimization:

    LDEL=λtextLtext+λimgLimg,\mathcal{L}_{\mathrm{DEL}} = \lambda_{\mathrm{text}}\mathcal{L}_{\mathrm{text}} + \lambda_{\mathrm{img}}\mathcal{L}_{\mathrm{img}},

    with λtext=λimg=1\lambda_{\mathrm{text}} = \lambda_{\mathrm{img}} = 1.

  • Pre-Context Script Splitting augments text-only phase: scripts are randomly partitioned, and the understanding expert learns narrative continuation from prefixed context.

4. Optimization and Losses

Optimization employs AdamW with learning rate 1e51\mathrm{e}{-5} and a total of 30K+10K30\mathrm{K}+10\mathrm{K} training steps. Primary losses include:

  • Next Token Prediction:

LNTP=t=1Tlogpθ(yty<t)\mathcal{L}_{\mathrm{NTP}} = -\sum_{t=1}^T\log p_\theta(y_t \mid y_{<t})

Lflow=01E[(X1X0)vϕ(Xt,t)2]dt\mathcal{L}_{\mathrm{flow}} = \int_0^1 \mathbb{E}\left[\left\| (X_1 - X_0) - v_\phi(X_t, t) \right\|^2 \right]\,dt

  • Expert-Load Balancing (Optional):

Lbal=αi=1E(rˉilogrˉi),rˉi=Ex[ri(x)]\mathcal{L}_{\mathrm{bal}} = \alpha\sum_{i=1}^E\bigl(\bar r_i\log\bar r_i\bigr),\quad \bar r_i = \mathbb{E}_x[r_i(x)]

The total stage-wise loss adds balancing terms as needed.

5. Inference Workflow

The inference pipeline follows these steps:

  1. Given user prompt ρ\rho, the Understanding Expert (θU\theta_U) generates the full script autoregressively: S^=(G,C)\hat S = (\mathcal{G}, \mathcal{C}).
  2. Optionally, an extension or continuation is requested via designated tokens.
  3. Script S^\hat S is segmented into nn shots.
  4. For each shot ii, the Generation Expert (θG\theta_G) samples VAE keyframe latents fiIif_i \equiv I_i, conditioned on corresponding text.
  5. Latents fif_i are decoded into images.
  6. Resulting script and images (S^,{fi})(\hat S, \{f_i\}) are forwarded to downstream audio-video generators (e.g., Veo3), along with extracted dialogue and sound tokens.

High-level pseudocode formally describes both training and inference routines, specifying dataset partitions and parameter initialization.

6. Datasets and Evaluation Metrics

Three principal datasets underpin the training stages:

  • ICL: 450K multi-shot text–image scripts.
  • DEL (text expert): 250K pure text scripts.
  • DEL (image expert): 250K single-shot text–image pairs.

UniMAGE is evaluated on ViStoryBench using metrics: | Metric | Description | Reported UniMAGE SOTA | |-----------------------------|--------------------------------------------------|-----------------------| | Style Similarity (CSD) | Consistency of visual style | - | | Character ID Similarity (CIDS) | Character identity preservation | 59.2 (vs. 57.0) | | Prompt Adherence (Alignment)| Fidelity to user instructions | 80.8 (vs. 62.5) | | OCCM | On-stage character count matching | 88.07 (vs. 87.0) | | Image Quality (Inception) | Standard image quality measure | - | | Aesthetics | Human-rated visual appeal | ≈4.55 (vs. 5.76) |

A human study with 50 participants ranks UniMAGE highest for narrative logic (GSB=0.72), character consistency, and overall quality (Zhang et al., 29 Dec 2025).

7. Significance and Implementation Context

UniMAGE demonstrates the technical feasibility of unifying imaginative reasoning and visual generation in a single scalable framework, substantially improving character consistency and narrative logic in script-to-video pipelines. The architecture, grounded in expertly routed multimodal attention with explicit expert specializations, enables direct re-implementation using Bagel/MoT codebases, provided access to the well-described datasets and hyperparameters. This suggests broad applicability for automated film production and creative multimedia authoring, especially in contexts requiring long-context story coherence and complex visual composition (Zhang et al., 29 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to UniMAGE.