OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation

Published 23 Oct 2024 in cs.CL, cs.AI, cs.SD, and eess.AS | (2410.17799v2)

Abstract: Full-duplex spoken dialogue systems significantly surpass traditional turn-based dialogue systems, as they allow simultaneous bidirectional communication, closely mirroring human-human interactions. However, achieving low latency and natural interactions in full-duplex dialogue systems remains a significant challenge, especially considering human conversation dynamics such as interruptions, backchannels, and overlapping speech. In this paper, we introduce a novel End-to-End GPT-based model OmniFlatten for full-duplex conversation, capable of effectively modeling the complex behaviors inherent to natural conversations with low latency. To achieve full-duplex conversation capabilities, we propose a multi-stage post-training scheme that progressively adapts a text LLM backbone into a speech-text dialogue LLM, capable of generating text and speech in real time, without modifying the architecture of the backbone LLM. The training process comprises three stages: modality alignment, half-duplex dialogue learning, and full-duplex dialogue learning. In all training stages, we standardize the data using a flattening operation, which enables unifying the training methods and the GPT backbone across different modalities and tasks. Our approach offers a simple modeling technique and a promising research direction for developing efficient and natural end-to-end full-duplex spoken dialogue systems. Audio samples of dialogues generated by OmniFlatten can be found at this web site (https://omniflatten.github.io/).

Abstract PDF HTML Upgrade to Chat

References (30)

Citations (1)

View on Semantic Scholar

Summary

The paper introduces OmniFlatten, a full-duplex GPT-based model that enables real-time, natural voice conversations.
It employs a three-stage training process—modality alignment, half-duplex, and full-duplex dialogue learning—to merge speech and text inputs into unified sequences.
Experimental results highlight improved dialogue quality with a low 160ms response time, underscoring its efficiency in real-world applications.

OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation

Introduction

The development of full-duplex spoken dialogue systems marks a significant advancement over traditional turn-based systems, capturing the complexities of human dialogues more effectively by enabling simultaneous communication. These systems closely resemble human-human interactions through their ability to handle interruptions, backchannels, and overlapping speech. The task of achieving low latency and natural interaction within full-duplex systems remains challenging. The paper "OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation" (2410.17799) introduces OmniFlatten, a novel end-to-end GPT-based model that addresses these challenges by modeling complex conversational behaviors with high efficiency.

Methodology

OmniFlatten employs a multi-stage post-training method to convert a pre-existing text-based LLM into a speech-text dialogue model capable of real-time conversation generation. This process involves three key stages: modality alignment, half-duplex dialogue learning, and full-duplex dialogue learning, with data standardized using a flattening operation that merges multi-modal inputs into a unified sequence.

Figure 1: The overview of OmniFlatten as an end-to-end full-duplex spoken dialogue model.

Audio Tokenization and Modality Alignment

The approach begins with audio tokenization, where continuous speech signals are converted into discrete speech tokens using a Vector Quantization layer, thus enabling the model to encode semantic information effectively. The initial training stage aims to align speech and text modalities, transforming the model into a multimodal entity proficient in both ASR and TTS. This involves fine-tuning the system through supervised learning tasks using diverse datasets to ensure accurate interpretation and generation capabilities.

Dialogue Learning

Half-duplex Dialogue Training

The OmniFlatten model undergoes initial training using half-duplex dialogues to align well with the modality framework set during alignment. This training simulates turn-based exchanges, allowing the model to predict interleaved chunks of conversation and refine its response-generation capabilities, as shown in Figure 2.

Figure 2: Half-duplex Dialogue Training based on all four streams of speech and text tokens.

Full-duplex Dialogue Training

Subsequent training stages progressively adapt the model to full-duplex dialogue scenarios, first employing three-stream data (excluding user text), then moving to two-stream data (removing Assistant text stream). This staged process reduces reliance on intermediate text and focuses on speech-to-speech interactions, minimizing latency and mimicking real conversational dynamics consistently.

Figure 3: Full-duplex Dialogue Training based on three streams of full-duplex dialogue data.

Figure 4: Full-duplex Dialogue Training based on two streams of full-duplex dialogue data.

Experimental Results

OmniFlatten exhibits impressive performance across its implemented procedures. The initial modality alignment stage demonstrates reasonable ASR and TTS accuracy, underscoring the system's ability to understand and produce speech without significant text dependency.

Subsequently, full-duplex training stages showcase enhancements in dialogue quality, with evaluations conducted using high-performing LLMs indicating substantive improvements in conversational flow and response accuracy. The model achieves a remarkable average Assistant turn-taking response time of 160ms, indicative of low latency.

Conclusion

OmniFlatten's approach to full-duplex spoken dialogue system development presents a promising direction for achieving natural, human-like interactions in AI-driven models without architectural modifications to existing LLMs. Future work will explore further data synthesis improvements and modality expansions, including visual capabilities, to enrich multi-modal conversational agents.

In conclusion, OmniFlatten's streamlined training approach offers significant implications for developing efficient full-duplex systems capable of real-time, natural interactions, providing a template for future evolutions in AI dialogue technologies.

Markdown Report Issue