VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

Published 3 Jan 2025 in cs.CV, cs.SD, and eess.AS | (2501.01957v3)

Abstract: Recent Multimodal LLMs (MLLMs) have typically focused on integrating visual and textual modalities, with less emphasis placed on the role of speech in enhancing interaction. However, speech plays a crucial role in multimodal dialogue systems, and implementing high-performance in both vision and speech tasks remains a significant challenge due to the fundamental modality differences. In this paper, we propose a carefully designed multi-stage training methodology that progressively trains LLM to understand both visual and speech information, ultimately enabling fluent vision and speech interaction. Our approach not only preserves strong vision-language capacity, but also enables efficient speech-to-speech dialogue capabilities without separate ASR and TTS modules, significantly accelerating multimodal end-to-end response speed. By comparing our method against state-of-the-art counterparts across benchmarks for image, video, and speech tasks, we demonstrate that our model is equipped with both strong visual and speech capabilities, making near real-time vision and speech interaction.

Abstract PDF Upgrade to Chat

Summary

The paper presents a three-stage training methodology that seamlessly integrates vision and speech modalities for real-time interaction.
It employs advanced architectures like InternViT and TiCodec to effectively process multimodal data, achieving competitive benchmark results.
Extensive evaluations demonstrate robust performance in vision-language tasks, video understanding, and multilingual ASR, eliminating the need for separate ASR and TTS systems.

VITA-1.5: Advancements in Multimodal LLMs with Vision and Speech Integration

This paper presents "VITA-1.5," a novel Multimodal LLM (MLLM) designed to integrate visual and speech modalities for seamless real-time interactions. This research addresses the significant challenge of effectively combining visual and speech information to enhance multimodal dialogue systems, which have traditionally focused more heavily on visual-textual modalities. The researchers propose a multi-stage training methodology that allows the model to manage and optimize the distinct features of vision and speech data while maintaining efficient processing capabilities.

Key Contributions

Three-stage Training Methodology:
- Stage 1 (Vision-Language Training): This stage focuses on integrating visual data into the LLM, employing a strategy that emphasizes vision alignment, vision understanding, and vision-specific fine-tuning (SFT).
- Stage 2 (Audio Input Tuning): Here, the model undergoes audio alignment to bridge the gap between speech and language. A subset of vision-language adaptations is utilized to help the model understand and respond to audio input.
- Stage 3 (Audio Output Tuning): This final stage introduces speech output capabilities, removing the need for external text-to-speech (TTS) modules and enhancing the user experience through end-to-end speech generation.
Model Architecture: VITA-1.5 incorporates a sophisticated architecture that includes vision and audio encoders, adapters, and a non-autoregressive speech decoder. Through the use of advanced components like InternViT and TiCodec, the model is structured to manage multimodal inputs and outputs efficiently.
Data Utilization: The model utilizes a comprehensive dataset covering a variety of modalities, including images, videos, speech transcription pairs, and text-speech pairs, sourced from diverse benchmarks to inform its training strategy.

Evaluation and Results

VITA-1.5 undergoes extensive evaluation across a broad array of benchmarks:

Vision-Language Capabilities: The model demonstrates performance on par with leading open-source MLLMs and even outperforms some proprietary models in terms of image understanding and reasoning tasks. The architecture effectively retains its vision-language strengths post-training in both speech tuning stages.
Video Understanding: Results indicate comparable performance to open-source models, though there exists potential for further development relative to top proprietary models.
Speech Recognition (ASR): VITA-1.5 shows strong results in both Mandarin and English ASR tasks, outperforming specialized speech models and thereby confirming the model's robust multimodal integration.

Implications and Future Directions

The research outlined in this paper has both practical and theoretical implications. Practically, VITA-1.5 provides a substantial step forward in creating more capable and efficient interactive multimodal dialogue systems, eliminating the need for separate ASR and TTS components, thus reducing system latency.

Theoretically, the novel training methodology demonstrates a viable framework for future endeavors in multimodal integration. Future developments in AI can build on this approach to further enhance the harmony between disparate modalities, potentially leading to even greater advancements in real-time human-computer interaction systems.

Overall, VITA-1.5 represents a significant contribution to the field of multimodal LLMs, providing a flexible and efficient architecture and training strategy that balances and optimizes both vision-language and speech interaction without compromising on performance in any individual domain.