Gemini Storybook Framework

Updated 8 February 2026

Gemini Storybook is a modular, multimodal framework that integrates narrative text, images, speech, and music into cohesive storybook videos using a multi-agent DAG architecture.
It employs specialized agents for tasks like brainstorming, chapter writing, and synchronized media generation to achieve iterative refinement and precise cross-modal alignment.
The framework allows seamless substitution of core models via structured JSON interfaces and supports rigorous evaluation with both automated metrics and human review.

Gemini Storybook refers to a modular, extensible framework for producing fully multimodal, AI-generated narrated storybooks, building on the MM-StoryAgent system. It delivers end-to-end pipelines that generate immersive storybook videos—synthesizing narrative text, semantically aligned images, speech narration, sound effects, and music—by orchestrating a suite of generative models and tools via a multi-agent paradigm. The architecture formalizes agent interactions as a directed acyclic graph (DAG), enabling compositionality, agent modularity, and seamless substitution of core models (including text, image, and audio backbones) using unified data interfaces (Xu et al., 7 Mar 2025).

1. System Architecture and Multi-Agent Coordination

The core of Gemini Storybook is a network of autonomous agents, each specialized for a stage or modality in story creation. The process begins with a user-provided setting and proceeds through sequential and parallelized modules, formalized as agents A₁ through A₁₄. Key responsibilities include:

Story-Setting Agent (A₁): Receives initial scenario or theme.
Attractiveness-Oriented Story Agents:
- QA Dialogue Agent (A₁): Iterative brainstorming via turn-based Q&A with LLMs.
- Outline Writer (A₂): Produces structured outlines from dialogue transcripts.
- Chapter Writer (A₃): Expands outline into sequential narrative chapters.
Modality-Specific Agents:
- Image Prompt Generator (A₄): Encodes chapter semantics into concise prompts.
- Role Extractor (A₅): Identifies and canonizes main characters.
- Prompt Revisers & Reviewers (A₆↔A₇): Iterative refinement of image prompts.
- Image Generator (A₈): Utilizes StoryDiffusion (Stable Diffusion XL variant) for coherent frame synthesis with self-attention over prior images.
- Speech Agent (A₉): Deploys CosyVoice TTS for narration.
- Sound/Music Agents (A₁₀–A₁₃): Extract, refine, and synthesize SFX via AudioLDM 2/Freesound and background music via MusicGen.
Video Composition Agent (A₁₄): Integrates all media (via MoviePy) into an aligned video artifact.

Inter-agent communication utilizes typed JSON payloads, formalized as $m_{i\to j}$ , in a DAG topology. The process is sketched by a LaTeX pipeline:

$\begin{aligned} &\text{Setting} \xrightarrow[\;\;]{A_1} \{\text{dialogs}\} \xrightarrow[\;\;]{A_2} \{\text{outline}\} \xrightarrow[\;\;]{A_3} \{\text{chapters}\} \ &\{\text{chapters}\} \xrightarrow[\;\;]{A_4,A_5,A_6,A_7} \{\text{img\_prompts}\} \xrightarrow[\;\;]{A_8} \{\text{images}\}, \ &\{\text{chapters}\} \xrightarrow[\;\;]{A_9} \{\text{speech}\}, \ &\{\text{chapters}\} \xrightarrow[\;\;]{A_{10},A_{11}} \{\text{sfx\_prompts}\} \xrightarrow[\;\;]{A_{12}} \{\text{sfx}\}, \ &\{\text{chapters}\} \xrightarrow[\;\;]{A_{13}} \{\text{music}\} \xrightarrow[\;\;]{A_{14}} \text{final video.} \end{aligned}$

This system architecture enables both pipeline extensibility and the integration of new model APIs, as all agents operate with structured, swappable I/O schemas.

2. Multi-Stage Writing and Brainstorming Pipeline

Gemini Storybook applies a multi-stage authorial process designed to maximize narrative expressiveness and control:

Dialogue-Enhanced Brainstorming (A₁):

For $t = 1 \ldots T_d$ turns: - $Q_t \leftarrow$ LLM simulating "Amateur Writer" initiates questions based on setting/history. - $A_t \leftarrow$ LLM as "Expert Writer" provides domain-refined answers.

Outline Generation (A₂):

$\text{Outline} = \mathrm{LLM}(\text{Prompt}_{\text{outline}}~|~\text{Dialogue history})$

Sequential Chapter Writing (A₃):

For $k = 1 \ldots K$ : - $\text{Context} \leftarrow \text{concat}(\text{setting, outline, chapters}[1\ldots k{-}1])$ - $\text{Chapter}_k = \arg\max_{s}~\mathrm{LLM}(\text{Prompt}_\text{chapter}(k)~|~\text{Context}_{<k})$

$\text{Chapter}_k = \arg\max_{s}\; \mathrm{LLM}\bigl(\text{“Write chapter }k\text{”}_{\text{Prompt}_k}\;|\;\text{Context}_{<k}\bigr)$

This pipeline supports iterative control and refinement at every writing stage, facilitating authoring of both plot and role continuity.

3. Generative Modules for Multimodal Asset Synthesis

3.1 Image Generation

Prompt Engineering: A₅ extracts canonical role vectors $R = \{R_1, \ldots, R_M\}$ , A₄ generates visual prompts from chapter text, and a loop of A₆ (reviser) and A₇ (reviewer) iteratively refines prompts within $T_p$ steps.
StoryDiffusion Backbone: Enforces cross-frame role consistency. For timestep $t$ , image $I_t$ is generated as:

$\tilde Q_t, \tilde K_t, \tilde V_t = \text{attn}\bigl(Q_t,\ [K_{<t},K_t],\ [V_{<t},V_t]\bigr)$

$\min_\theta \sum_{t=1}^K \mathbb{E}_\varepsilon \bigl\| \hat{I}_t - G_\theta(P_t, I_{<t}, \varepsilon) \bigr\|^2$

3.2 Audio Synthesis

Narration: CosyVoice TTS processes entire narrative, producing $S_{\text{speech}}$ .
Sound Effects: A₁₀/A₁₁ loop extracts and polishes SFX prompts; AudioLDM 2 or Freesound API synthesizes $S_{\text{sfx},i}$ .
Music: Single prompt to MusicGen (or retrieval) generates $S_{\text{music}}$ , matched to the story context.

4. Temporal and Multimodal Alignment Strategies

Precise multimedia synchronization is critical for immersive storybook experiences.

Temporal Assignment: For chapter/page $k$ , narrated duration $t_k$ determines image display; SFX $s_i$ is stretched or truncated to $t_k$ ; music is adjusted to fit $T_{\text{vid}} = \sum_k t_k$ .
Fine-Grained Audio-Visual Sync: Embeddings $x$ (framestamp $(I_k)$ ), $y$ (mel-frames $(S_{\text{speech}})$ ) are temporally aligned via DTW (Dynamic Time Warping):

$\mathrm{DTW}(x, y) = \min_\pi \sum_{(i, j) \in \pi} d(x_i, y_j)$

This allows optional speech warping to match visual transitions.

Visual Effects: Includes random slow pans/zooms for each $I_k$ , with slide transitions for inter-chapter continuity.

5. Evaluation Metrics and Benchmarking

Gemini Storybook includes rigorous objective and subjective evaluation protocols.

Automated Metrics:
- Textual Story Quality: Scored using GPT-4 on [Attractiveness (A), Warmth (W), Education (E)], 1–5 scale.
- Cross-Modal Alignment:
- I–T (image-text): CLIPScore
- S–T / M–T (speech/music-text): CLAPScore
- I–S / I–M (image-sound/music): Wav2CLIP

Metric	Direct Baseline	StoryAgent
Attractiveness (A)	3.80	3.94
Warmth (W)	4.18	4.21
Education (E)	3.58	3.79
I–T (CLIPScore)	0.297	0.316
S–T	0.214	0.240
M–T	0.301	0.525
I–S	0.054	0.049
I–M	0.042	0.049

Human Evaluation: Three raters review 20 videos, scoring 1–5 on text and modality alignment. Gains for StoryAgent over Direct pipeline include +0.17 for Warmth and Education, image alignment (2.77→3.47), and music alignment (2.57→2.93).

6. Extensibility, Agent Swappability, and Open Source

A defining feature is agent modularity. Each module communicates using structured JSON, permitting its replacement with alternative models that comply with the same schema. To integrate an AI model such as a future Gemini component:

Textual Generation: Swap LLM (A₁–A₃) with Gemini-Text.
Image Synthesis: Replace StoryDiffusion with Gemini-Image.
Audio Synthesis: Substitute CosyVoice, MusicGen, and AudioLDM with Gemini-Audio API.

No core pipeline reengineering is necessary; new endpoints are registered via the agent registry. The open-source codebase (on GitHub) and demo (via HuggingFace Spaces) provide reference implementations and extensibility blueprints.

This design enables research and enterprise users to adopt, extend, and customize Gemini Storybook pipelines for specialized or production-grade multimodal storytelling scaffolds, maintaining strict separation of concerns and promoting rapid innovation.

Markdown Report Issue Upgrade to Chat

References (1)

MM-StoryAgent: Immersive Narrated Storybook Video Generation with a Multi-Agent Paradigm across Text, Image and Audio (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Gemini Storybook.