Papers
Topics
Authors
Recent
Search
2000 character limit reached

Gemini Storybook Framework

Updated 8 February 2026
  • Gemini Storybook is a modular, multimodal framework that integrates narrative text, images, speech, and music into cohesive storybook videos using a multi-agent DAG architecture.
  • It employs specialized agents for tasks like brainstorming, chapter writing, and synchronized media generation to achieve iterative refinement and precise cross-modal alignment.
  • The framework allows seamless substitution of core models via structured JSON interfaces and supports rigorous evaluation with both automated metrics and human review.

Gemini Storybook refers to a modular, extensible framework for producing fully multimodal, AI-generated narrated storybooks, building on the MM-StoryAgent system. It delivers end-to-end pipelines that generate immersive storybook videos—synthesizing narrative text, semantically aligned images, speech narration, sound effects, and music—by orchestrating a suite of generative models and tools via a multi-agent paradigm. The architecture formalizes agent interactions as a directed acyclic graph (DAG), enabling compositionality, agent modularity, and seamless substitution of core models (including text, image, and audio backbones) using unified data interfaces (Xu et al., 7 Mar 2025).

1. System Architecture and Multi-Agent Coordination

The core of Gemini Storybook is a network of autonomous agents, each specialized for a stage or modality in story creation. The process begins with a user-provided setting and proceeds through sequential and parallelized modules, formalized as agents A₁ through A₁₄. Key responsibilities include:

  • Story-Setting Agent (A₁): Receives initial scenario or theme.
  • Attractiveness-Oriented Story Agents:
    • QA Dialogue Agent (A₁): Iterative brainstorming via turn-based Q&A with LLMs.
    • Outline Writer (A₂): Produces structured outlines from dialogue transcripts.
    • Chapter Writer (A₃): Expands outline into sequential narrative chapters.
  • Modality-Specific Agents:
    • Image Prompt Generator (A₄): Encodes chapter semantics into concise prompts.
    • Role Extractor (A₅): Identifies and canonizes main characters.
    • Prompt Revisers & Reviewers (A₆↔A₇): Iterative refinement of image prompts.
    • Image Generator (A₈): Utilizes StoryDiffusion (Stable Diffusion XL variant) for coherent frame synthesis with self-attention over prior images.
    • Speech Agent (A₉): Deploys CosyVoice TTS for narration.
    • Sound/Music Agents (A₁₀–A₁₃): Extract, refine, and synthesize SFX via AudioLDM 2/Freesound and background music via MusicGen.
  • Video Composition Agent (A₁₄): Integrates all media (via MoviePy) into an aligned video artifact.

Inter-agent communication utilizes typed JSON payloads, formalized as mijm_{i\to j}, in a DAG topology. The process is sketched by a LaTeX pipeline:

Setting    A1{dialogs}    A2{outline}    A3{chapters} {chapters}    A4,A5,A6,A7{img_prompts}    A8{images}, {chapters}    A9{speech}, {chapters}    A10,A11{sfx_prompts}    A12{sfx}, {chapters}    A13{music}    A14final video.\begin{aligned} &\text{Setting} \xrightarrow[\;\;]{A_1} \{\text{dialogs}\} \xrightarrow[\;\;]{A_2} \{\text{outline}\} \xrightarrow[\;\;]{A_3} \{\text{chapters}\} \ &\{\text{chapters}\} \xrightarrow[\;\;]{A_4,A_5,A_6,A_7} \{\text{img\_prompts}\} \xrightarrow[\;\;]{A_8} \{\text{images}\}, \ &\{\text{chapters}\} \xrightarrow[\;\;]{A_9} \{\text{speech}\}, \ &\{\text{chapters}\} \xrightarrow[\;\;]{A_{10},A_{11}} \{\text{sfx\_prompts}\} \xrightarrow[\;\;]{A_{12}} \{\text{sfx}\}, \ &\{\text{chapters}\} \xrightarrow[\;\;]{A_{13}} \{\text{music}\} \xrightarrow[\;\;]{A_{14}} \text{final video.} \end{aligned}

This system architecture enables both pipeline extensibility and the integration of new model APIs, as all agents operate with structured, swappable I/O schemas.

2. Multi-Stage Writing and Brainstorming Pipeline

Gemini Storybook applies a multi-stage authorial process designed to maximize narrative expressiveness and control:

  1. Dialogue-Enhanced Brainstorming (A₁):

For t=1Tdt = 1 \ldots T_d turns: - QtQ_t \leftarrow LLM simulating "Amateur Writer" initiates questions based on setting/history. - AtA_t \leftarrow LLM as "Expert Writer" provides domain-refined answers.

  1. Outline Generation (A₂):

Outline=LLM(Promptoutline  Dialogue history)\text{Outline} = \mathrm{LLM}(\text{Prompt}_{\text{outline}}~|~\text{Dialogue history})

  1. Sequential Chapter Writing (A₃):

For k=1Kk = 1 \ldots K: - Contextconcat(setting, outline, chapters[1k1])\text{Context} \leftarrow \text{concat}(\text{setting, outline, chapters}[1\ldots k{-}1]) - Chapterk=argmaxs LLM(Promptchapter(k)  Context<k)\text{Chapter}_k = \arg\max_{s}~\mathrm{LLM}(\text{Prompt}_\text{chapter}(k)~|~\text{Context}_{<k})

Chapterk=argmaxs  LLM(“Write chapter kPromptk    Context<k)\text{Chapter}_k = \arg\max_{s}\; \mathrm{LLM}\bigl(\text{“Write chapter }k\text{”}_{\text{Prompt}_k}\;|\;\text{Context}_{<k}\bigr)

This pipeline supports iterative control and refinement at every writing stage, facilitating authoring of both plot and role continuity.

3. Generative Modules for Multimodal Asset Synthesis

3.1 Image Generation

  • Prompt Engineering: A₅ extracts canonical role vectors R={R1,,RM}R = \{R_1, \ldots, R_M\}, A₄ generates visual prompts from chapter text, and a loop of A₆ (reviser) and A₇ (reviewer) iteratively refines prompts within TpT_p steps.
  • StoryDiffusion Backbone: Enforces cross-frame role consistency. For timestep tt, image ItI_t is generated as:

Q~t,K~t,V~t=attn(Qt, [K<t,Kt], [V<t,Vt])\tilde Q_t, \tilde K_t, \tilde V_t = \text{attn}\bigl(Q_t,\ [K_{<t},K_t],\ [V_{<t},V_t]\bigr)

minθt=1KEεI^tGθ(Pt,I<t,ε)2\min_\theta \sum_{t=1}^K \mathbb{E}_\varepsilon \bigl\| \hat{I}_t - G_\theta(P_t, I_{<t}, \varepsilon) \bigr\|^2

3.2 Audio Synthesis

  • Narration: CosyVoice TTS processes entire narrative, producing SspeechS_{\text{speech}}.
  • Sound Effects: A₁₀/A₁₁ loop extracts and polishes SFX prompts; AudioLDM 2 or Freesound API synthesizes Ssfx,iS_{\text{sfx},i}.
  • Music: Single prompt to MusicGen (or retrieval) generates SmusicS_{\text{music}}, matched to the story context.

4. Temporal and Multimodal Alignment Strategies

Precise multimedia synchronization is critical for immersive storybook experiences.

  • Temporal Assignment: For chapter/page kk, narrated duration tkt_k determines image display; SFX sis_i is stretched or truncated to tkt_k; music is adjusted to fit Tvid=ktkT_{\text{vid}} = \sum_k t_k.
  • Fine-Grained Audio-Visual Sync: Embeddings xx (framestamp(Ik)(I_k)), yy (mel-frames(Sspeech)(S_{\text{speech}})) are temporally aligned via DTW (Dynamic Time Warping):

DTW(x,y)=minπ(i,j)πd(xi,yj)\mathrm{DTW}(x, y) = \min_\pi \sum_{(i, j) \in \pi} d(x_i, y_j)

This allows optional speech warping to match visual transitions.

  • Visual Effects: Includes random slow pans/zooms for each IkI_k, with slide transitions for inter-chapter continuity.

5. Evaluation Metrics and Benchmarking

Gemini Storybook includes rigorous objective and subjective evaluation protocols.

  • Automated Metrics:
    • Textual Story Quality: Scored using GPT-4 on [Attractiveness (A), Warmth (W), Education (E)], 1–5 scale.
    • Cross-Modal Alignment:
    • I–T (image-text): CLIPScore
    • S–T / M–T (speech/music-text): CLAPScore
    • I–S / I–M (image-sound/music): Wav2CLIP
Metric Direct Baseline StoryAgent
Attractiveness (A) 3.80 3.94
Warmth (W) 4.18 4.21
Education (E) 3.58 3.79
I–T (CLIPScore) 0.297 0.316
S–T 0.214 0.240
M–T 0.301 0.525
I–S 0.054 0.049
I–M 0.042 0.049
  • Human Evaluation: Three raters review 20 videos, scoring 1–5 on text and modality alignment. Gains for StoryAgent over Direct pipeline include +0.17 for Warmth and Education, image alignment (2.77→3.47), and music alignment (2.57→2.93).

6. Extensibility, Agent Swappability, and Open Source

A defining feature is agent modularity. Each module communicates using structured JSON, permitting its replacement with alternative models that comply with the same schema. To integrate an AI model such as a future Gemini component:

  • Textual Generation: Swap LLM (A₁–A₃) with Gemini-Text.
  • Image Synthesis: Replace StoryDiffusion with Gemini-Image.
  • Audio Synthesis: Substitute CosyVoice, MusicGen, and AudioLDM with Gemini-Audio API.

No core pipeline reengineering is necessary; new endpoints are registered via the agent registry. The open-source codebase (on GitHub) and demo (via HuggingFace Spaces) provide reference implementations and extensibility blueprints.

This design enables research and enterprise users to adopt, extend, and customize Gemini Storybook pipelines for specialized or production-grade multimodal storytelling scaffolds, maintaining strict separation of concerns and promoting rapid innovation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Gemini Storybook.