StoryGAN: A Sequential Conditional GAN for Story Visualization

Published 6 Dec 2018 in cs.CV | (1812.02784v2)

Abstract: We propose a new task, called Story Visualization. Given a multi-sentence paragraph, the story is visualized by generating a sequence of images, one for each sentence. In contrast to video generation, story visualization focuses less on the continuity in generated images (frames), but more on the global consistency across dynamic scenes and characters -- a challenge that has not been addressed by any single-image or video generation methods. We therefore propose a new story-to-image-sequence generation model, StoryGAN, based on the sequential conditional GAN framework. Our model is unique in that it consists of a deep Context Encoder that dynamically tracks the story flow, and two discriminators at the story and image levels, to enhance the image quality and the consistency of the generated sequences. To evaluate the model, we modified existing datasets to create the CLEVR-SV and Pororo-SV datasets. Empirically, StoryGAN outperforms state-of-the-art models in image quality, contextual consistency metrics, and human evaluation.

Abstract PDF Upgrade to Chat

Citations (200)

View on Semantic Scholar

Summary

The paper introduces a novel StoryGAN framework that leverages a deep Context Encoder and dual discriminators to generate globally consistent image sequences from multi-sentence stories.
The paper demonstrates superior performance on CLEVR-SV and Pororo-SV datasets, achieving higher image quality and improved SSIM and human evaluation scores over state-of-the-art models.
The paper suggests that the innovative Text2Gist approach and dual discriminator design can inspire future advancements in AI-driven automated comic strip and narrative video generation.

StoryGAN: A Sequential Conditional GAN for Story Visualization

The paper introduces StoryGAN, a novel approach to a unique task termed "Story Visualization." This task involves transforming a multi-sentence paragraph into a coherent sequence of images, with one image per sentence, emphasizing global consistency across different images and scenes, as opposed to the continuous frame focus of video generation. The paper highlights the challenges present in this task, such as maintaining consistency in the characters and scenes depicted, and proposes a solution through the StoryGAN framework.

StoryGAN Framework

StoryGAN is built upon the sequential conditional GAN framework, introducing innovative components not previously seen in image or video generation methods. It features a deep Context Encoder combining a GRU cell and a newly developed Text2Gist cell, specifically designed to capture and dynamically track the flow of the story. This Context Encoder effectively incorporates prior contextual information to enhance semantic coherence across generated images. The model further distinguishes itself with two discriminators operating at different levels: an image-level discriminator to ensure sentence-image relevance and a story-level discriminator to uphold the global consistency between image sequences and the overarching story.

Evaluation and Empirical Results

Two novel datasets, CLEVR-SV and Pororo-SV, were crafted from existing datasets for comparative evaluation of StoryGAN. Results indicate that StoryGAN surpasses state-of-the-art models in image quality, contextual consistency, and human evaluation metrics. These enhancements are attributed primarily to the Context Encoder's efficacy and the dual discriminator framework, which collaboratively work to produce a sequence of high-fidelity images that maintain story coherence.

Numerical Results and Contributions

The paper presents quantitative analysis using the Structural Similarity Index (SSIM) on the CLEVR-SV dataset, demonstrating improved similarity with ground truth images when employing StoryGAN over other methods. Furthermore, human evaluation on the Pororo-SV dataset supplements these numerical results with subjective quality assessments, where StoryGAN consistently scored higher compared to baselines. These findings reinforce the paper's claims about the superior performance of StoryGAN in handling the story visualization task.

Implications and Future Directions

The implications of this research extend beyond the field of story visualization. The novel Text2Gist and dual discriminator approaches introduce promising avenues for further research in sequential data generation, potentially influencing areas such as automated comic strip generation and narrative-based video creation. Furthermore, this work sets the stage for advancements in understanding complex multi-modal sequences, contributing to the larger field of AI-driven content creation.

As for future developments, the refinement of these methods in the context of even more intricate stories or diverse datasets could push the boundaries of how AI-assisted storytelling is perceived. Additionally, improved computational techniques and incorporation of richer datasets might further expand the capabilities of StoryGAN and similar models.

In summary, this paper presents significant advancements in the visualization of textual stories through images, suggesting substantial theoretical and practical contributions to the future of AI-generated content.