InternGPT: Solving Vision-Centric Tasks by Interacting with ChatGPT Beyond Language

Published 9 May 2023 in cs.CV | (2305.05662v4)

Abstract: We present an interactive visual framework named InternGPT, or iGPT for short. The framework integrates chatbots that have planning and reasoning capabilities, such as ChatGPT, with non-verbal instructions like pointing movements that enable users to directly manipulate images or videos on the screen. Pointing (including gestures, cursors, etc.) movements can provide more flexibility and precision in performing vision-centric tasks that require fine-grained control, editing, and generation of visual content. The name InternGPT stands for \textbf{inter}action, \textbf{n}onverbal, and \textbf{chat}bots. Different from existing interactive systems that rely on pure language, by incorporating pointing instructions, the proposed iGPT significantly improves the efficiency of communication between users and chatbots, as well as the accuracy of chatbots in vision-centric tasks, especially in complicated visual scenarios where the number of objects is greater than 2. Additionally, in iGPT, an auxiliary control mechanism is used to improve the control capability of LLM, and a large vision-LLM termed Husky is fine-tuned for high-quality multi-modal dialogue (impressing ChatGPT-3.5-turbo with 93.89\% GPT-4 Quality). We hope this work can spark new ideas and directions for future interactive visual systems. Welcome to watch the code at https://github.com/OpenGVLab/InternGPT.

Abstract PDF Upgrade to Chat

Citations (72)

View on Semantic Scholar

Summary

The paper introduces InternGPT, a novel framework that integrates large language models with gesture-based inputs to enhance precision in visual tasks.
It employs a perception unit, an LLM controller, and an open-world toolkit to process images and videos using methods like SAM and OCR.
User studies reveal that InternGPT outperforms traditional systems in efficiency and quality, underscoring its potential in fields such as healthcare and surveillance.

Overview of InternGPT: A Framework for Vision-Centric Tasks

The paper presents InternGPT (iGPT), an innovative framework designed to enhance interaction with vision-centric tasks by integrating LLMs with non-verbal instructions. This system advances the current state-of-the-art in interactive systems by incorporating pointing gestures in addition to text-based communication, thereby improving the efficiency and accuracy of AI-driven visual tasks.

Problem Statement

Contemporary methods for vision-centric tasks primarily rely on purely language-based instructions, which can be inefficient and imprecise, particularly in complex visual scenarios involving multiple objects. iGPT aims to overcome these limitations by permitting users to interact through both verbal and non-verbal cues, offering a more intuitive and precise interface for task completion.

Key Contributions

InternGPT integrates LLMs with visual and pointing interactions through three main components:

Perception Unit: This component processes pointing instructions on images or videos, enabling precise object selection and manipulation. Techniques like SAM and OCR are utilized for semantic segmentation and text extraction.
LLM Controller: The controller facilitates the parsing and execution of complex language commands. It leverages an auxiliary control mechanism to ensure precise task execution, even when the LLM struggles with API invocation.
Open-World Toolkit: This toolkit incorporates a variety of online models and applications, enabling the system to perform a wide range of tasks, from image editing to video annotation. Notable tools include Stable Diffusion and Husky, a large vision-LLM optimized for high-quality multi-modal dialogue.

Numerical Results and Evaluation

In user studies, iGPT demonstrated increased efficiency over traditional interactive systems like Visual ChatGPT, requiring fewer attempts and shorter prompts to achieve satisfactory results in vision-centric tasks. Additionally, the framework received favorable rankings in user preference due to its improved interactivity and output quality.

The paper also showcases the capabilities of Husky, a significant component of the system, which exhibits near-GPT-4 performance in various dialogue tasks, as verified by ChatGPT-3.5-turbo.

Implications and Future Developments

iGPT has the potential to reshape human-computer interaction by offering a more responsive and adaptable framework for vision tasks. Its design allows it to cater to various interaction levels, from basic command execution to complex reasoning involving multi-modal instructions.

The introduction of pointing gestures enriches communication paradigms between humans and machines, potentially fostering advancements in fields such as autonomous vehicles, healthcare imaging, and smart surveillance. Moreover, integrating more sophisticated task allocation mechanisms could further enhance the system’s scalability and adaptability.

Future directions include improving model performance and interaction scalability, refining user interfaces, and exploring additional applications requiring intricate coordination between language and vision models.

Conclusion

InternGPT represents a forward step in interactive visual frameworks, merging the strengths of LLMs with intuitive gesture-based control. It provides a robust baseline for future development, emphasizing user-centric design and multi-modal interaction to improve the accuracy and efficiency of vision-centric tasks.

Markdown Report Issue