Compositional Chain-of-Thought Prompting for Large Multimodal Models

Published 27 Nov 2023 in cs.CV, cs.AI, cs.CL, and cs.LG | (2311.17076v3)

Abstract: The combination of strong visual backbones and LLM reasoning has led to Large Multimodal Models (LMMs) becoming the current standard for a wide range of vision and language (VL) tasks. However, recent research has shown that even the most advanced LMMs still struggle to capture aspects of compositional visual reasoning, such as attributes and relationships between objects. One solution is to utilize scene graphs (SGs)--a formalization of objects and their relations and attributes that has been extensively used as a bridge between the visual and textual domains. Yet, scene graph data requires scene graph annotations, which are expensive to collect and thus not easily scalable. Moreover, finetuning an LMM based on SG data can lead to catastrophic forgetting of the pretraining objective. To overcome this, inspired by chain-of-thought methods, we propose Compositional Chain-of-Thought (CCoT), a novel zero-shot Chain-of-Thought prompting method that utilizes SG representations in order to extract compositional knowledge from an LMM. Specifically, we first generate an SG using the LMM, and then use that SG in the prompt to produce a response. Through extensive experiments, we find that the proposed CCoT approach not only improves LMM performance on several vision and language VL compositional benchmarks but also improves the performance of several popular LMMs on general multimodal benchmarks, without the need for fine-tuning or annotated ground-truth SGs. Code: https://github.com/chancharikmitra/CCoT

Abstract PDF HTML Upgrade to Chat

References (91)

Citations (45)

View on Semantic Scholar

Summary

The paper introduces a zero-shot method, Compositional Chain-of-Thought (CCoT), that generates scene graphs using LMMs to improve visual reasoning without additional tuning.
It employs a two-step prompting process that first creates structured scene graphs and then integrates them into response generation for enhanced compositional understanding.
Experiments on benchmarks like Winoground and MMBench show that CCoT significantly boosts compositional reasoning in models such as GPT-4V and LLaVA.

Compositional Chain-of-Thought Prompting for Large Multimodal Models

Introduction

The paper "Compositional Chain-of-Thought Prompting for Large Multimodal Models" addresses the limitations in compositional visual reasoning present in even the most advanced Large Multimodal Models (LMMs), such as LLaVA and GPT-4V. These models often treat images merely as collections of objects, which impedes the understanding of complex visual scenes involving relationships between objects and their attributes. Scene graphs (SGs) have been shown to bridge the gap between visual and textual data by formalizing these elements. However, they require extensive annotation, making them expensive and impractical for large-scale use, and fine-tuning LMMs with SGs can cause catastrophic forgetting of pretraining objectives.

To overcome these challenges, the authors propose a novel zero-shot Chain-of-Thought prompting mechanism dubbed Compositional Chain-of-Thought (CCoT). This method leverages SG representations without the need for annotated SG data or model fine-tuning. Instead of relying solely on pre-existing SG annotations, the CCoT approach generates SGs using LMMs and employs them in a two-step prompting process to extract and utilize compositional knowledge effectively.

Figure 1: A high-level overview of the Compositional Chain-of-Thought (CCoT) approach.

Large Multimodal Models (LMMs): LMMs integrate the powerful reasoning capabilities of LLMs with visual perception models to achieve superior performance in vision-language tasks. Despite advancements, these models are restricted by the need for extensive annotated data and risk losing learned objectives through fine-tuning.

Scene Graphs and Multimodal Prompting: Scene graphs provide structured representations of visual scenes, capturing objects and their interrelations syntactically. Chain-of-Thought (CoT) methodologies in LLMs have previously demonstrated improved reasoning capabilities. Recent strategies such as Tree-of-Thought (ToT) and Graph-of-Thought (GoT) seek to enhance sequential reasoning models with structured inputs, and CCoT builds upon these by integrating SGs into multimodal models without additional training.

Compositionality: In the VL context, compositionality refers to the understanding and reasoning over multi-component structures within visual data. Studies have identified significant gaps in existing models' capacities to perform compositional reasoning, largely attributing these to oversimplified object-centered visual processing.

Compositional Chain-of-Thought (CCoT)

The CCoT approach introduces a series of algorithmic steps designed to enhance LMMs' compositional reasoning capabilities without the drawbacks of annotated SG data and model fine-tuning.

Step 1: Scene Graph Generation

To circumvent the necessity for annotated SG data, the method uses LMM to generate a scene graph $S_g$ . This graph articulates an organized structure of objects, their attributes, and the relationships within the context of a given image and task prompt. By employing a JSON format, the process aims to standardize the representation, making it more intuitively interpretable by models.

Figure 2: Full prompt example of CCoT.

Step 2: Response Generation

In this phase, the scene graph serves as an intermediate representation enabling robust reasoning and response generation. By integrating the scene graph directly into the prompting mechanism, the model can produce informed answers to visual questions devoid of the risk of pretraining objective loss inherent to fine-tuning.

Experiments and Results

The CCoT methodology was tested on a variety of popular LMM architectures including InstructBLIP-13B, LLaVA-1.5, Sphinx, and GPT-4V, showing marked improvements across vision-language benchmarks such as Winoground, WHOOPS!, SEEDBench, and MMBench.

Figure 3: Example Outputs showcasing the method's successes and failures.

CCoT demonstrated significant enhancement in the models' evaluations on compositional benchmarks without the need for additional training. Specifically, the models exhibited improvements in complex reasoning tasks typically restricted by conventional LMMs' capabilities.

Conclusion

The Compositional Chain-of-Thought (CCoT) provides an innovative approach for advancing the compositional reasoning capacities of large multimodal models. By utilizing generated scene graphs as a zero-shot prompting methodology, CCoT enhances visual-linguistic models' understanding and reasoning abilities across various datasets without the need for annotation-heavy SGs or fine-tuning practices. This approach not only overcomes existing model limitations but also sets a precedent for scalable and efficient deployment in more complex reasoning applications.

Markdown Report Issue