CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation

Published 30 Nov 2023 in cs.CV, cs.AI, cs.CL, cs.LG, cs.SD, and eess.AS | (2311.18775v1)

Abstract: We present CoDi-2, a versatile and interactive Multimodal LLM (MLLM) that can follow complex multimodal interleaved instructions, conduct in-context learning (ICL), reason, chat, edit, etc., in an any-to-any input-output modality paradigm. By aligning modalities with language for both encoding and generation, CoDi-2 empowers LLMs to not only understand complex modality-interleaved instructions and in-context examples, but also autoregressively generate grounded and coherent multimodal outputs in the continuous feature space. To train CoDi-2, we build a large-scale generation dataset encompassing in-context multimodal instructions across text, vision, and audio. CoDi-2 demonstrates a wide range of zero-shot capabilities for multimodal generation, such as in-context learning, reasoning, and compositionality of any-to-any modality generation through multi-round interactive conversation. CoDi-2 surpasses previous domain-specific models on tasks such as subject-driven image generation, vision transformation, and audio editing. CoDi-2 signifies a substantial breakthrough in developing a comprehensive multimodal foundation model adept at interpreting in-context language-vision-audio interleaved instructions and producing multimodal outputs.

Abstract PDF HTML Upgrade to Chat

References (40)

Citations (30)

View on Semantic Scholar

Summary

The paper introduces CoDi-2, an innovative multimodal model that performs in-context learning across text, vision, and audio.
Its architecture integrates specialized encoders, decoders, and diffusion models to achieve coherent interleaved generation.
Empirical results demonstrate superior performance in complex tasks like image editing, audio fusion, and reasoning in zero/few-shot settings.

CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation

Introduction

The paper "CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation" (2311.18775) presents CoDi-2, an advanced Multimodal LLM (MLLM) capable of handling complex multimodal interleaved instructions across any combination of modalities—text, vision, and audio. The unique architecture empowers the model to perform in-context learning, reasoning, and chatting tasks in an any-to-any input-output modality paradigm, marking a significant step in multimodal generation.

Architecture and Methodology

CoDi-2 is built upon a sophisticated framework that integrates a multimodal LLM with specialized encoders and decoders for processing and generating content across different modalities. The core innovation lies in aligning these modalities with language to exploit the strong language reasoning capabilities inherent in LLMs.

Figure 1: CoDi-2's architecture, incorporating encoder and decoder mechanisms for audio and vision inputs alongside a LLM.

The model employs diffusion models for decoding image or audio inputs, leveraging pixel loss and token loss for effective training. This approach allows CoDi-2 to process interleaved inputs, such as language mixed with visual and auditory data, and generate coherent multimodal outputs.

Dataset Construction

The training of CoDi-2 involves a large-scale generation dataset comprising in-context multimodal instructions that span text, vision, and audio. This dataset is meticulously constructed to ensure a wide range of zero-shot capabilities, including multimodal generation, reasoning, and compositionality tasks. Additionally, innovative methods for building text-only datasets facilitate multimodal in-context learning, where textual descriptions are used to represent multimodal components.

Experimental Evaluation

Empirical evaluations demonstrate CoDi-2's superior performance in various multimodal generation tasks, such as audio fusion and editing, complex image composition, reasoning, and exemplar-based learning.

Figure 2: Multi-round conversation between humans and CoDi-2 offering in-context multimodal instructions for image editing.

CoDi-2 achieves substantial performance advancements over domain-specific models, particularly in subject-driven image generation, vision transformation, and audio editing. The model's adaptability and robustness in both zero-shot and few-shot settings highlight its potential for practical applications in interactive multimodal systems.

Conclusion

CoDi-2 represents a comprehensive advancement in the development of multimodal foundation models, effectively bridging the gap between in-context language-vision-audio interleaved instructions and multimodal output generation. Its architecture and training methodology pave the way for future exploration and enhancement of multimodal capabilities, positioning CoDi-2 as a critical tool in expanding the boundaries of AI-generated content.

Through its robust design and extensive dataset utilization, CoDi-2 holds promise for further advancements in AI, potentially inspiring new research directions aimed at refining and extending multimodal instructional generation systems.