Kosmos-2: Grounding Multimodal Large Language Models to the World

Published 26 Jun 2023 in cs.CL and cs.CV | (2306.14824v3)

Abstract: We introduce Kosmos-2, a Multimodal LLM (MLLM), enabling new capabilities of perceiving object descriptions (e.g., bounding boxes) and grounding text to the visual world. Specifically, we represent refer expressions as links in Markdown, i.e., ``text span'', where object descriptions are sequences of location tokens. Together with multimodal corpora, we construct large-scale data of grounded image-text pairs (called GrIT) to train the model. In addition to the existing capabilities of MLLMs (e.g., perceiving general modalities, following instructions, and performing in-context learning), Kosmos-2 integrates the grounding capability into downstream applications. We evaluate Kosmos-2 on a wide range of tasks, including (i) multimodal grounding, such as referring expression comprehension, and phrase grounding, (ii) multimodal referring, such as referring expression generation, (iii) perception-language tasks, and (iv) language understanding and generation. This work lays out the foundation for the development of Embodiment AI and sheds light on the big convergence of language, multimodal perception, action, and world modeling, which is a key step toward artificial general intelligence. Code and pretrained models are available at https://aka.ms/kosmos-2.

Abstract PDF Upgrade to Chat

Citations (562)

View on Semantic Scholar

Summary

The paper introduces Kosmos-2, a model that integrates text with visual inputs using a novel Markdown-style linking technique.
Its methodology creates large-scale grounded image-text pairs from datasets like LAION-2B and COYO-700M to train on referring expressions and visual grounding.
Evaluations on phrase grounding and visual tasks demonstrate Kosmos-2's competitive performance and robust zero-shot multimodal comprehension.

Overview of Kosmos-2: Grounding Multimodal LLMs to the World

The paper introduces Kosmos-2, a Multimodal LLM (MLLM) designed to enhance the capabilities of grounding LLMs with visual inputs. Kosmos-2 stands out by integrating object and region recognition directly into LLMs, allowing for seamless interaction between text and visual data.

Key Contributions

The authors have developed Kosmos-2 to incorporate grounding and referring capabilities by using a novel approach that represents referential expressions through Markdown-style links connecting text spans to bounding boxes in images. This is achieved using a dataset called GrIT, consisting of grounded image-text pairs constructed from multimodal corpora.

Methodology

Kosmos-2 builds on previous MLLMs, advancing their ability to associate text directly with specific visual elements:

Grounded Image-Text Pair Creation: Using sources like LAION-2B and COYO-700M, the authors created a large-scale dataset, GrIT, by identifying and linking noun phrases in captions to image regions.
Model Architecture: Kosmos-2 utilizes a Transformer-based architecture, trained with both traditional text inputs and grounded image-text pairs to predict next-word tokens.
Input Representation: The model connects text spans to visual data using a sequence of tokens representing bounding box coordinates, which are then integrated into the text input via a Markdown-style hyperlink format.

Evaluation

Kosmos-2 was evaluated across a broad spectrum of tasks to demonstrate its enhanced grounding capabilities:

Multimodal Grounding: Phrase grounding and referring expression comprehension tasks were used to quantify the model's proficiency in linking text to specific visual regions.
Multimodal Referring: The model was also assessed on its ability to generate descriptive text based on specified image regions, showcasing its capability to interpret visual information.
Perception-Language Tasks: These included standard image captioning and visual question answering, where the model's results were competitive with existing models.
Language Tasks: Evaluations confirmed Kosmos-2's capability in handling traditional language tasks, maintaining performance while integrating new capabilities.

Numerical Performance

On the phrase grounding task with the Flickr30k Entities dataset, Kosmos-2 achieved R@1 scores that were competitive with traditional models, notably surpassing models that require fine-tuning. Zero-shot performance on referring tasks also evidenced strong competency, particularly on datasets like RefCOCOg.

Implications and Future Directions

The introduction of grounding capabilities marks a significant step in the progression towards more sophisticated artificial intelligence systems capable of complex multimodal tasks. The potential applications span interactive AI systems, enhanced image captioning, and refined visual question answering. Future considerations include refining the model's understanding of varied human expressions and extending its capabilities to more diverse and nuanced multimodal datasets.

Kosmos-2 presents an innovative approach to the intersection of language and visual perception, setting the groundwork for future developments towards artificial general intelligence where LLMs can seamlessly integrate and interact with the visual world.

Markdown Report Issue