Papers
Topics
Authors
Recent
Search
2000 character limit reached

CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation

Published 30 Nov 2023 in cs.CV, cs.AI, cs.CL, cs.LG, cs.SD, and eess.AS | (2311.18775v1)

Abstract: We present CoDi-2, a versatile and interactive Multimodal LLM (MLLM) that can follow complex multimodal interleaved instructions, conduct in-context learning (ICL), reason, chat, edit, etc., in an any-to-any input-output modality paradigm. By aligning modalities with language for both encoding and generation, CoDi-2 empowers LLMs to not only understand complex modality-interleaved instructions and in-context examples, but also autoregressively generate grounded and coherent multimodal outputs in the continuous feature space. To train CoDi-2, we build a large-scale generation dataset encompassing in-context multimodal instructions across text, vision, and audio. CoDi-2 demonstrates a wide range of zero-shot capabilities for multimodal generation, such as in-context learning, reasoning, and compositionality of any-to-any modality generation through multi-round interactive conversation. CoDi-2 surpasses previous domain-specific models on tasks such as subject-driven image generation, vision transformation, and audio editing. CoDi-2 signifies a substantial breakthrough in developing a comprehensive multimodal foundation model adept at interpreting in-context language-vision-audio interleaved instructions and producing multimodal outputs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1728–1738, 2021.
  2. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18392–18402, 2023.
  3. Re-imagen: Retrieval-augmented text-to-image generator. arXiv preprint arXiv:2209.14491, 2022.
  4. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
  5. Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010, 2023.
  6. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 776–780. IEEE, 2017.
  7. Imagebind: One embedding space to bind them all. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15180–15190, 2023.
  8. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  9. Reversion: Diffusion-based relation inversion from images. arXiv preprint arXiv:2303.13495, 2023.
  10. Generating images with multimodal language models. arXiv preprint arXiv:2305.17216, 2023.
  11. Multi-concept customization of text-to-image diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1931–1941, 2023.
  12. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. International Journal of Computer Vision, 128(7):1956–1981, 2020.
  13. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023.
  14. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023a.
  15. Audioldm 2: Learning holistic audio generation with self-supervised pretraining. arXiv preprint arXiv:2308.05734, 2023b.
  16. Chameleon: Plug-and-play compositional reasoning with large language models. arXiv preprint arXiv:2304.09842, 2023.
  17. Sdedit: Guided image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073, 2021.
  18. OpenAI. Gpt-4 technical report, 2023.
  19. Kosmos-g: Generating images in context with multimodal large language models. arXiv preprint arXiv:2310.02992, 2023a.
  20. Kosmos-g: Generating images in context with multimodal large language models, 2023b.
  21. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
  22. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
  23. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22500–22510, 2023.
  24. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs, 2021.
  25. Generative pretraining in multimodality. arXiv preprint arXiv:2307.05222, 2023.
  26. Any-to-any generation via composable diffusion. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  27. Stanford alpaca: An instruction-following llama model, 2023.
  28. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  29. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  30. Audit: Audio editing by following instructions with latent diffusion models. arXiv preprint arXiv:2304.00830, 2023a.
  31. In-context learning unlocked for diffusion models. arXiv preprint arXiv:2305.01115, 2023b.
  32. The generative ai paradox:” what it can create, it may not understand”. arXiv preprint arXiv:2311.00059, 2023.
  33. Next-gpt: Any-to-any multimodal llm. arXiv preprint arXiv:2309.05519, 2023.
  34. i-code: An integrative and composable multimodal learning framework. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 10880–10890, 2023.
  35. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023.
  36. Ferret: Refer and ground anything anywhere at any granularity, 2023.
  37. Merlot reserve: Neural script knowledge through vision and language and sound. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16375–16387, 2022.
  38. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023.
  39. A survey of large language models. arXiv preprint arXiv:2303.18223, 2023.
  40. Minigpt-5: Interleaved vision-and-language generation via generative vokens. arXiv preprint arXiv:2310.02239, 2023.
Citations (30)

Summary

  • The paper introduces CoDi-2, an innovative multimodal model that performs in-context learning across text, vision, and audio.
  • Its architecture integrates specialized encoders, decoders, and diffusion models to achieve coherent interleaved generation.
  • Empirical results demonstrate superior performance in complex tasks like image editing, audio fusion, and reasoning in zero/few-shot settings.

CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation

Introduction

The paper "CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation" (2311.18775) presents CoDi-2, an advanced Multimodal LLM (MLLM) capable of handling complex multimodal interleaved instructions across any combination of modalities—text, vision, and audio. The unique architecture empowers the model to perform in-context learning, reasoning, and chatting tasks in an any-to-any input-output modality paradigm, marking a significant step in multimodal generation.

Architecture and Methodology

CoDi-2 is built upon a sophisticated framework that integrates a multimodal LLM with specialized encoders and decoders for processing and generating content across different modalities. The core innovation lies in aligning these modalities with language to exploit the strong language reasoning capabilities inherent in LLMs. Figure 1

Figure 1: CoDi-2's architecture, incorporating encoder and decoder mechanisms for audio and vision inputs alongside a LLM.

The model employs diffusion models for decoding image or audio inputs, leveraging pixel loss and token loss for effective training. This approach allows CoDi-2 to process interleaved inputs, such as language mixed with visual and auditory data, and generate coherent multimodal outputs.

Dataset Construction

The training of CoDi-2 involves a large-scale generation dataset comprising in-context multimodal instructions that span text, vision, and audio. This dataset is meticulously constructed to ensure a wide range of zero-shot capabilities, including multimodal generation, reasoning, and compositionality tasks. Additionally, innovative methods for building text-only datasets facilitate multimodal in-context learning, where textual descriptions are used to represent multimodal components.

Experimental Evaluation

Empirical evaluations demonstrate CoDi-2's superior performance in various multimodal generation tasks, such as audio fusion and editing, complex image composition, reasoning, and exemplar-based learning. Figure 2

Figure 2: Multi-round conversation between humans and CoDi-2 offering in-context multimodal instructions for image editing.

CoDi-2 achieves substantial performance advancements over domain-specific models, particularly in subject-driven image generation, vision transformation, and audio editing. The model's adaptability and robustness in both zero-shot and few-shot settings highlight its potential for practical applications in interactive multimodal systems.

Conclusion

CoDi-2 represents a comprehensive advancement in the development of multimodal foundation models, effectively bridging the gap between in-context language-vision-audio interleaved instructions and multimodal output generation. Its architecture and training methodology pave the way for future exploration and enhancement of multimodal capabilities, positioning CoDi-2 as a critical tool in expanding the boundaries of AI-generated content.

Through its robust design and extensive dataset utilization, CoDi-2 holds promise for further advancements in AI, potentially inspiring new research directions aimed at refining and extending multimodal instructional generation systems.

Paper to Video (Beta)

To view this video please enable JavaScript, and consider upgrading to a web browser that supports HTML5 video.

Whiteboard

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.