Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models

Published 31 Aug 2023 in cs.CV and cs.CL | (2308.16463v3)

Abstract: LLMs exhibit enhanced zero-shot performance on various tasks when fine-tuned with instruction-following data. Multimodal instruction-following models extend these capabilities by integrating both text and images. However, existing models such as MiniGPT-4 and LLaVA face challenges in maintaining dialogue coherence in scenarios involving multiple images. A primary reason is the lack of a specialized dataset for this critical application. To bridge these gaps, we introduce SparklesDialogue, the first machine-generated dialogue dataset tailored for word-level interleaved multi-image and text interactions. Furthermore, we construct SparklesEval, a GPT-assisted benchmark for quantitatively assessing a model's conversational competence across multiple images and dialogue turns. We then present SparklesChat, a multimodal instruction-following model for open-ended dialogues across multiple images. Our experiments validate the effectiveness of training SparklesChat with SparklesDialogue based on MiniGPT-4 and LLaVA-v1.5, which enhances comprehension across multiple images and dialogue turns, and does not compromise single-image understanding capabilities. Qualitative evaluations further demonstrate SparklesChat's generality in handling real-world applications. All resources related to this study are publicly available at https://github.com/HYPJUDY/Sparkles.

Abstract PDF Upgrade to Chat

Citations (16)

View on Semantic Scholar

Summary

The paper introduces SparklesChat, a multimodal model that interleaves text and images to achieve natural, coherent dialogues.
The paper presents SparklesDialogue, the first machine-generated dataset for word-level interleaved image and text interactions.
The paper proposes SparklesEval, a benchmark that quantitatively measures image reasoning and dialogue coherence, showing significant improvement over MiniGPT-4.

An Overview of the SparklesChat Paper

The paper presents a novel approach to enhancing multimodal instruction-following models by introducing SparklesChat, a model designed for conducting open-ended dialogues involving multiple images. The study targets a significant challenge prevalent in existing models such as MiniGPT-4, which struggle to maintain dialogue coherence when handling interactions that span across multiple images. This limitation is primarily attributed to the absence of a specialized dataset tailored for such applications. To bridge this gap, the authors have developed SparklesDialogue, a machine-generated dataset specifically designed for word-level interleaved interactions involving multiple images and text. Furthermore, the paper introduces SparklesEval, a benchmark created to evaluate a model's conversational competence within this context.

Key Contributions

SparklesChat Model: SparklesChat is a multimodal instruction-following model that facilitates dialogues over multiple images at a fine-grained word level. It integrates image representations interleaved within text to mimic natural human interaction more effectively.
SparklesDialogue Dataset: This is the first dataset generated for interleaved multi-image and text interactions. It leverages the capabilities of GPT-4 to simulate user-assistant dialogues, creating robust and diverse conversational data with subsets sourced from different image and description datasets.
SparklesEval Benchmark: A quantitative evaluation benchmark, SparklesEval uses a GPT-assisted system to assess models on three key criteria: image understanding and reasoning, cross-image and cross-turn coherence, and relevance and completeness of responses. This creates a comprehensive framework to measure multimodal conversational competence.

Experimental Evaluation

The research team conducted extensive experiments to validate SparklesChat's capabilities against existing models. SparklesChat demonstrated superior performance over MiniGPT-4 and other models on established vision-language tasks. Specifically, SparklesChat outperformed MiniGPT-4, achieving 56.7% accuracy on the BISON binary image selection task and 58.0% on the NLVR2 visual reasoning task. Moreover, in the SparklesEval benchmark, SparklesChat scored 8.56 out of 10, markedly surpassing MiniGPT-4's score of 3.91 and approaching the performance of GPT-4, which scored 9.26.

Implications and Future Directions

The introduction of SparklesChat along with its supporting dataset and benchmark has significant implications both practically and theoretically. Practically, it enhances the capacity for multimodal models to engage in more coherent and contextually aware dialogues involving intricate image and text interactions, expanding the applicability of AI in fields requiring rich visual dialogue understanding, such as smart-assistants and educational tools. Theoretically, the work highlights the importance of specialized datasets and evaluation benchmarks tailored to specific capabilities of AI models, providing new opportunities for improving and evaluating multimodal systems.

Future developments could explore advancements in vision encoders to further improve image understanding, as well as the expansion of datasets to include a wider array of image types and more complex dialogues. Additionally, improving the interpretability and efficiency of these models remains a promising area for research, alongside enhancing their ability to handle real-world applications through fine-grained contextual and visual understanding.

Overall, this paper sets a cornerstone for future research and development in the field of multimodal instruction-following models, offering substantial resources and insights into the challenges of integrating and reasoning over diverse visual inputs within conversational frameworks.