Emu: Generative Pretraining in Multimodality

Published 11 Jul 2023 in cs.CV | (2307.05222v2)

Abstract: We present Emu, a Transformer-based multimodal foundation model, which can seamlessly generate images and texts in multimodal context. This omnivore model can take in any single-modality or multimodal data input indiscriminately (e.g., interleaved image, text and video) through a one-model-for-all autoregressive training process. First, visual signals are encoded into embeddings, and together with text tokens form an interleaved input sequence. Emu is then end-to-end trained with a unified objective of classifying the next text token or regressing the next visual embedding in the multimodal sequence. This versatile multimodality empowers the exploration of diverse pretraining data sources at scale, such as videos with interleaved frames and text, webpages with interleaved images and text, as well as web-scale image-text pairs and video-text pairs. Emu can serve as a generalist multimodal interface for both image-to-text and text-to-image tasks, and supports in-context image and text generation. Across a broad range of zero-shot/few-shot tasks including image captioning, visual question answering, video question answering and text-to-image generation, Emu demonstrates superb performance compared to state-of-the-art large multimodal models. Extended capabilities such as multimodal assistants via instruction tuning are also demonstrated with impressive performance.

Abstract PDF HTML Upgrade to Chat

References (76)

Citations (110)

View on Semantic Scholar

Summary

The paper introduces Emu, a unified multimodal model that uses a one-model-for-all autoregressive training process to integrate text, image, and video data.
The model’s architecture combines components like EVA-CLIP, LLaMA, and Stable Diffusion, achieving a zero-shot CIDEr score of 112.4 in image captioning.
The paper demonstrates that leveraging diverse large-scale datasets and few-shot prompting can significantly enhance multimodal AI performance across various tasks.

Analysis of "Generative Pretraining in Multimodality"

The paper under review introduces Emu, a Transformer-based multimodal foundation model designed to generate images and texts from a multimodal context. The Emu model is notable for its ability to handle inputs from various modalities—such as text, images, and video—without discrimination. By utilizing a one-model-for-all autoregressive training process, Emu is trained to predict the next text token or to regress the next visual embedding within a sequence. This approach stands out due to its seamless integration of diverse data sources at scale.

Model Architecture and Training

The architecture of Emu is constructed of several components: a Visual Encoder using EVA-CLIP, a Causal Transformer for transforming visual signals to a latent space, a Multimodal Modeling component leveraging LLaMA, and a Visual Decoder initialized with Stable Diffusion. The training involves a unified autoregressive objective aimed at predicting the next element in a multimodal sequence, applying cross-entropy classification loss for text tokens and L2 regression loss for visual embeddings. Key to its design is the Causal Transformer, which transforms spatial visual signals into 1D sequences within a latent space, bypassing the traditional image generation in pixel space.

Emu is pretrained on expansive datasets including LAION-2B, LAION-COCO, MMC4, WebVid-10M, and the newly introduced YT-Storyboard-1B. Training is executed using large-scale infrastructure, optimizing parameters across batch sizes tailored to different dataset modalities.

Evaluation and Results

Emu's performance is rigorously evaluated across a variety of tasks: image captioning, visual question answering, video question answering, and text-to-image generation. During zero-shot evaluations, Emu surpasses state-of-the-art models in multiple benchmarks. The introduction of few-shot prompting enhances its task-specific performance further. Additionally, Emu showcases in-context learning abilities, highlighting its capacity to handle tasks with minimal examples effectively.

Significantly, the paper reports that Emu achieves a zero-shot CIDEr score of 112.4 in image captioning on the COCO benchmark—a substantial improvement over contemporary models. The instructional tuning of Emu (referred to as Emu-I) is noteworthy, aligning the model well with human intent and demonstrating considerable advancements in performance metrics compared to several larger models.

Implications and Future Directions

Emu's contributions are multifaceted. The model's ability to perform diverse tasks such as image captioning and text-to-image generation positions it as a generalist multimodal interface. Emu's framework underlines the potential benefits of large-scale, diverse data integration, particularly when video-text datasets are incorporated into training.

The implications of this research extend into theoretical advancements in multimodal Transformer architectures and practical applications in deploying LMMs for real-world use cases. Future developments could explore refining the model's text-to-image generation capability, potentially enhancing the fidelity and relevance of generated visuals via more extensive fine-tuning or alternative architectures. Additionally, Emu's adoption of video-derived data opens avenues for richer, more dynamic AI applications in video content understanding and generation.

Overall, the paper provides a comprehensive evaluation of a robust and versatile multimodal model, setting a new benchmark in the field of multimodal AI research. The inclusion of diverse multimodal data and the unified training approach presents compelling directions for further exploration in multimodal AI systems.

Markdown Report Issue