Papers
Topics
Authors
Recent
Search
2000 character limit reached

Compositional Chain-of-Thought Prompting for Large Multimodal Models

Published 27 Nov 2023 in cs.CV, cs.AI, cs.CL, and cs.LG | (2311.17076v3)

Abstract: The combination of strong visual backbones and LLM reasoning has led to Large Multimodal Models (LMMs) becoming the current standard for a wide range of vision and language (VL) tasks. However, recent research has shown that even the most advanced LMMs still struggle to capture aspects of compositional visual reasoning, such as attributes and relationships between objects. One solution is to utilize scene graphs (SGs)--a formalization of objects and their relations and attributes that has been extensively used as a bridge between the visual and textual domains. Yet, scene graph data requires scene graph annotations, which are expensive to collect and thus not easily scalable. Moreover, finetuning an LMM based on SG data can lead to catastrophic forgetting of the pretraining objective. To overcome this, inspired by chain-of-thought methods, we propose Compositional Chain-of-Thought (CCoT), a novel zero-shot Chain-of-Thought prompting method that utilizes SG representations in order to extract compositional knowledge from an LMM. Specifically, we first generate an SG using the LMM, and then use that SG in the prompt to produce a response. Through extensive experiments, we find that the proposed CCoT approach not only improves LMM performance on several vision and language VL compositional benchmarks but also improves the performance of several popular LMMs on general multimodal benchmarks, without the need for fine-tuning or annotated ground-truth SGs. Code: https://github.com/chancharikmitra/CCoT

Definition Search Book Streamline Icon: https://streamlinehq.com
References (91)
  1. Flamingo: a visual language model for few-shot learning. ArXiv, abs/2204.14198, 2022.
  2. FETA: Towards specializing foundational models for expert task applications. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022.
  3. Deep compositional question answering with neural module networks. ArXiv, abs/1511.02799, 2015a.
  4. Neural module networks. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 39–48, 2015b.
  5. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433, 2015.
  6. Bringing image scene structure to video via frame-clip consistency of object tokens. In Thirty-Sixth Conference on Neural Information Processing Systems, 2022.
  7. Qwen-vl: A frontier large vision-language model with versatile abilities. ArXiv, abs/2308.12966, 2023.
  8. Compositional video synthesis with action graphs. In ICML, 2021.
  9. Object level visual reasoning in videos. In ECCV, pages 105–121, 2018.
  10. Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261, 2018.
  11. Graph of thoughts: Solving elaborate problems with large language models. ArXiv, abs/2308.09687, 2023.
  12. Breaking common sense: Whoops! a vision-and-language benchmark of synthetic and compositional images. ArXiv, abs/2303.07274, 2023.
  13. Language models are few-shot learners. ArXiv, abs/2005.14165, 2020.
  14. Uniter: Universal image-text representation learning. In ECCV, 2020.
  15. Palm: Scaling language modeling with pathways. J. Mach. Learn. Res., 24:240:1–240:113, 2022.
  16. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
  17. A survey on in-context learning. 2022.
  18. Teaching structured vision & language concepts to vision & language models. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2657–2668, 2022.
  19. Palm-e: An embodied multimodal language model. In International Conference on Machine Learning, 2023.
  20. Cyclip: Cyclic contrastive language-image pretraining. arXiv preprint arXiv:2205.14459, 2022.
  21. Multimodal-gpt: A vision and language model for dialogue with humans. ArXiv, abs/2305.04790, 2023.
  22. Visual programming: Compositional visual reasoning without training. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14953–14962, 2022.
  23. Momentum contrast for unsupervised visual representation learning. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9726–9735, 2019.
  24. Mapping images to scene graphs with permutation-invariant structured prediction. In Advances in Neural Information Processing Systems (NIPS), 2018.
  25. Spatio-temporal action graph networks. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pages 0–0, 2019.
  26. Learning canonical representations for scene graph to image generation. In European Conference on Computer Vision, 2020.
  27. Object-region video transformers. In Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  28. Incorporating structured representations into pretrained vision \& language models using scene graphs. In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023.
  29. Gqa: A new dataset for real-world visual reasoning and compositional question answering. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6693–6702, 2019.
  30. Learning object detection from captions via textual scene attributes. ArXiv, abs/2009.14558, 2020.
  31. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, 2021.
  32. Image retrieval using scene graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3668–3678, 2015.
  33. Inferring and executing programs for visual reasoning. 2017 IEEE International Conference on Computer Vision (ICCV), pages 3008–3017, 2017.
  34. Image generation from scene graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1219–1228, 2018.
  35. Large language models are zero-shot reasoners. ArXiv, abs/2205.11916, 2022a.
  36. Large language models are zero-shot reasoners. ArXiv, abs/2205.11916, 2022b.
  37. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123(1):32–73, 2017.
  38. Boosting logical reasoning in large language models through a new framework: The graph of thought. ArXiv, abs/2308.08614, 2023.
  39. Seed-bench: Benchmarking multimodal llms with generative comprehension. ArXiv, abs/2307.16125, 2023a.
  40. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. arXiv preprint arXiv:2201.12086, 2022.
  41. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2023b.
  42. Oscar: Object-semantics aligned pre-training for vision-language tasks. ECCV 2020, 2020.
  43. Microsoft coco: Common objects in context. In ECCV, 2014.
  44. Improved baselines with visual instruction tuning, 2023a.
  45. Visual instruction tuning. In NeurIPS, 2023b.
  46. Mmbench: Is your multi-modal model an all-around player? ArXiv, abs/2307.06281, 2023c.
  47. Learn to explain: Multimodal reasoning via thought chains for science question answering. ArXiv, abs/2209.09513, 2022.
  48. Chameleon: Plug-and-play compositional reasoning with large language models. ArXiv, abs/2304.09842, 2023.
  49. Fairness-guided few-shot prompting for large language models. ArXiv, abs/2303.13217, 2023.
  50. Crepe: Can vision-language foundation models reason compositionally? ArXiv, abs/2212.07796, 2022.
  51. Ok-vqa: A visual question answering benchmark requiring external knowledge. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3190–3199, 2019.
  52. Something-else: Compositional action recognition with spatial-temporal interaction networks. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020.
  53. Rethinking the role of demonstrations: What makes in-context learning work? ArXiv, abs/2202.12837, 2022.
  54. OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023.
  55. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
  56. Toolllm: Facilitating large language models to master 16000+ real-world apis. ArXiv, abs/2307.16789, 2023.
  57. Differentiable scene graphs. In WACV, 2020.
  58. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.
  59. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67, 2019.
  60. Scienceqa: a novel resource for question answering on scholarly articles. International Journal on Digital Libraries, 23:289 – 301, 2022.
  61. Toolformer: Language models can teach themselves to use tools. ArXiv, abs/2302.04761, 2023.
  62. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face. ArXiv, abs/2303.17580, 2023.
  63. Modular visual question answering via code generation. ArXiv, abs/2306.05392, 2023.
  64. Vipergpt: Visual inference via python execution for reasoning. ArXiv, abs/2303.08128, 2023.
  65. Lxmert: Learning cross-modality encoder representations from transformers. pages 5099–5110. Association for Computational Linguistics, 2019.
  66. Ul2: Unifying language learning paradigms. In International Conference on Learning Representations, 2022.
  67. Winoground: Probing vision and language models for visio-linguistic compositionality. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5238–5248, 2022.
  68. Better zero-shot reasoning with self-adaptive prompting. In Annual Meeting of the Association for Computational Linguistics, 2023.
  69. T-sciq: Teaching multimodal chain-of-thought reasoning via large language model signals for science question answering. ArXiv, abs/2305.03453, 2023a.
  70. Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. In Annual Meeting of the Association for Computational Linguistics, 2023b.
  71. Videos as space-time region graphs. In ECCV, 2018.
  72. Self-consistency improves chain of thought reasoning in language models. ArXiv, abs/2203.11171, 2022a.
  73. Language models with image descriptors are strong few-shot video-language learners. ArXiv, abs/2205.10747, 2022b.
  74. Finetuned language models are zero-shot learners. ArXiv, abs/2109.01652, 2021.
  75. Chain of thought prompting elicits reasoning in large language models. ArXiv, abs/2201.11903, 2022.
  76. Visual chatgpt: Talking, drawing and editing with visual foundation models. ArXiv, abs/2303.04671, 2023.
  77. Expertprompting: Instructing large language models to be distinguished experts. ArXiv, abs/2305.14688, 2023.
  78. Scene Graph Generation by Iterative Message Passing. In CVPR, pages 3097–3106, 2017.
  79. Panoptic scene graph generation. In European Conference on Computer Vision, 2022.
  80. Tree of thoughts: Deliberate problem solving with large language models. ArXiv, abs/2305.10601, 2023a.
  81. Beyond chain-of-thought, effective graph-of-thought reasoning in large language models. ArXiv, abs/2305.16582, 2023b.
  82. mplug-owl: Modularization empowers large language models with multimodality. ArXiv, abs/2304.14178, 2023a.
  83. mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration. ArXiv, abs/2311.04257, 2023b.
  84. Ernie-vil: Knowledge enhanced vision-language representations through scene graphs. Proceedings of the AAAI Conference on Artificial Intelligence, 35(4):3208–3216, 2021.
  85. When and why vision-language models behave like bags-of-words, and what to do about it? In International Conference on Learning Representations, 2023.
  86. Automatic chain of thought prompting in large language models. ArXiv, abs/2210.03493, 2022.
  87. Multimodal chain-of-thought reasoning in language models. ArXiv, abs/2302.00923, 2023.
  88. Svit: Scaling up visual instruction tuning. ArXiv, abs/2307.04087, 2023.
  89. Vl-checklist: Evaluating pre-trained vision-language models with objects, attributes and relations. arXiv preprint arXiv:2207.00221, 2022.
  90. Ddcot: Duty-distinct chain-of-thought prompting for multimodal reasoning in language models. ArXiv, abs/2310.16436, 2023.
  91. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
Citations (45)

Summary

  • The paper introduces a zero-shot method, Compositional Chain-of-Thought (CCoT), that generates scene graphs using LMMs to improve visual reasoning without additional tuning.
  • It employs a two-step prompting process that first creates structured scene graphs and then integrates them into response generation for enhanced compositional understanding.
  • Experiments on benchmarks like Winoground and MMBench show that CCoT significantly boosts compositional reasoning in models such as GPT-4V and LLaVA.

Compositional Chain-of-Thought Prompting for Large Multimodal Models

Introduction

The paper "Compositional Chain-of-Thought Prompting for Large Multimodal Models" addresses the limitations in compositional visual reasoning present in even the most advanced Large Multimodal Models (LMMs), such as LLaVA and GPT-4V. These models often treat images merely as collections of objects, which impedes the understanding of complex visual scenes involving relationships between objects and their attributes. Scene graphs (SGs) have been shown to bridge the gap between visual and textual data by formalizing these elements. However, they require extensive annotation, making them expensive and impractical for large-scale use, and fine-tuning LMMs with SGs can cause catastrophic forgetting of pretraining objectives.

To overcome these challenges, the authors propose a novel zero-shot Chain-of-Thought prompting mechanism dubbed Compositional Chain-of-Thought (CCoT). This method leverages SG representations without the need for annotated SG data or model fine-tuning. Instead of relying solely on pre-existing SG annotations, the CCoT approach generates SGs using LMMs and employs them in a two-step prompting process to extract and utilize compositional knowledge effectively. Figure 1

Figure 1: A high-level overview of the Compositional Chain-of-Thought (CCoT) approach.

Large Multimodal Models (LMMs): LMMs integrate the powerful reasoning capabilities of LLMs with visual perception models to achieve superior performance in vision-language tasks. Despite advancements, these models are restricted by the need for extensive annotated data and risk losing learned objectives through fine-tuning.

Scene Graphs and Multimodal Prompting: Scene graphs provide structured representations of visual scenes, capturing objects and their interrelations syntactically. Chain-of-Thought (CoT) methodologies in LLMs have previously demonstrated improved reasoning capabilities. Recent strategies such as Tree-of-Thought (ToT) and Graph-of-Thought (GoT) seek to enhance sequential reasoning models with structured inputs, and CCoT builds upon these by integrating SGs into multimodal models without additional training.

Compositionality: In the VL context, compositionality refers to the understanding and reasoning over multi-component structures within visual data. Studies have identified significant gaps in existing models' capacities to perform compositional reasoning, largely attributing these to oversimplified object-centered visual processing.

Compositional Chain-of-Thought (CCoT)

The CCoT approach introduces a series of algorithmic steps designed to enhance LMMs' compositional reasoning capabilities without the drawbacks of annotated SG data and model fine-tuning.

Step 1: Scene Graph Generation

To circumvent the necessity for annotated SG data, the method uses LMM to generate a scene graph SgS_g. This graph articulates an organized structure of objects, their attributes, and the relationships within the context of a given image and task prompt. By employing a JSON format, the process aims to standardize the representation, making it more intuitively interpretable by models. Figure 2

Figure 2: Full prompt example of CCoT.

Step 2: Response Generation

In this phase, the scene graph serves as an intermediate representation enabling robust reasoning and response generation. By integrating the scene graph directly into the prompting mechanism, the model can produce informed answers to visual questions devoid of the risk of pretraining objective loss inherent to fine-tuning.

Experiments and Results

The CCoT methodology was tested on a variety of popular LMM architectures including InstructBLIP-13B, LLaVA-1.5, Sphinx, and GPT-4V, showing marked improvements across vision-language benchmarks such as Winoground, WHOOPS!, SEEDBench, and MMBench. Figure 3

Figure 3: Example Outputs showcasing the method's successes and failures.

CCoT demonstrated significant enhancement in the models' evaluations on compositional benchmarks without the need for additional training. Specifically, the models exhibited improvements in complex reasoning tasks typically restricted by conventional LMMs' capabilities.

Conclusion

The Compositional Chain-of-Thought (CCoT) provides an innovative approach for advancing the compositional reasoning capacities of large multimodal models. By utilizing generated scene graphs as a zero-shot prompting methodology, CCoT enhances visual-linguistic models' understanding and reasoning abilities across various datasets without the need for annotation-heavy SGs or fine-tuning practices. This approach not only overcomes existing model limitations but also sets a precedent for scalable and efficient deployment in more complex reasoning applications.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 3 tweets with 94 likes about this paper.