Papers
Topics
Authors
Recent
Search
2000 character limit reached

OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation

Published 23 Oct 2024 in cs.CL, cs.AI, cs.SD, and eess.AS | (2410.17799v2)

Abstract: Full-duplex spoken dialogue systems significantly surpass traditional turn-based dialogue systems, as they allow simultaneous bidirectional communication, closely mirroring human-human interactions. However, achieving low latency and natural interactions in full-duplex dialogue systems remains a significant challenge, especially considering human conversation dynamics such as interruptions, backchannels, and overlapping speech. In this paper, we introduce a novel End-to-End GPT-based model OmniFlatten for full-duplex conversation, capable of effectively modeling the complex behaviors inherent to natural conversations with low latency. To achieve full-duplex conversation capabilities, we propose a multi-stage post-training scheme that progressively adapts a text LLM backbone into a speech-text dialogue LLM, capable of generating text and speech in real time, without modifying the architecture of the backbone LLM. The training process comprises three stages: modality alignment, half-duplex dialogue learning, and full-duplex dialogue learning. In all training stages, we standardize the data using a flattening operation, which enables unifying the training methods and the GPT backbone across different modalities and tasks. Our approach offers a simple modeling technique and a promising research direction for developing efficient and natural end-to-end full-duplex spoken dialogue systems. Audio samples of dialogues generated by OmniFlatten can be found at this web site (https://omniflatten.github.io/).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. Funaudiollm: Voice understanding and generation foundation models for natural interaction between humans and llms.
  2. Qwen2-audio technical report.
  3. Moshi: a speech-text foundation model for real-time dialogue. Technical report, Kyutai.
  4. Enhancing chat language models by scaling high-quality instructional conversations. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 3029–3051. Association for Computational Linguistics.
  5. Cosyvoice: A scalable multilingual zero-shot text-to-speech synthesizer based on supervised semantic tokens.
  6. Lauragpt: Listen, attend, understand, and regenerate audio with gpt.
  7. Llama-omni: Seamless speech interaction with large language models.
  8. Vita: Towards open-source interactive omni multimodal llm.
  9. TED-LIUM 3: Twice as much data and corpus repartition for experiments on speaker adaptation. In Speech and Computer - 20th International Conference, SPECOM 2018, Leipzig, Germany, September 18-22, 2018, Proceedings, volume 11096 of Lecture Notes in Computer Science, pages 198–208. Springer.
  10. Towards better instruction following language models for chinese: Investigating the impact of training data and evaluation.
  11. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. In NeurIPS.
  12. Flow matching for generative modeling. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net.
  13. Librispeech: An asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5206–5210.
  14. Instruction tuning with GPT-4. CoRR, abs/2304.03277.
  15. Mls: A large-scale multilingual dataset for speech research. ArXiv, abs/2012.03411.
  16. Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 28492–28518. PMLR.
  17. MUSAN: A Music, Speech, and Noise Corpus. ArXiv:1510.08484v1.
  18. Moss: An open conversational large language model. Machine Intelligence Research, 21(5):888–905.
  19. Salmonn: Towards generic hearing abilities for large language models.
  20. Improving and generalizing flow-based generative models with minibatch optimal transport. Trans. Mach. Learn. Res., 2024.
  21. Beyond turn-based interfaces: Synchronous llms as full-duplex dialogue agents.
  22. Voxpopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, pages 993–1003. Association for Computational Linguistics.
  23. Zhifei Xie and Changqiao Wu. 2024. Mini-omni: Language models can hear, talk while thinking in streaming.
  24. Qwen2 technical report.
  25. Xin Xu Shaoji Zhang Ming Li Yao Shi, Hui Bu. 2015. Aishell-3: A multi-speaker mandarin tts corpus and the baselines.
  26. Libritts: A corpus derived from librispeech for text-to-speech. In 20th Annual Conference of the International Speech Communication Association, Interspeech 2019, Graz, Austria, September 15-19, 2019, pages 1526–1530. ISCA.
  27. Wenetspeech: A 10000+ hours multi-domain mandarin corpus for speech recognition. In International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE.
  28. Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities.
  29. Judging llm-as-a-judge with mt-bench and chatbot arena. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023.
  30. 3d-speaker: A large-scale multi-device, multi-distance, and multi-dialect corpus for speech representation disentanglement.
Citations (1)

Summary

  • The paper introduces OmniFlatten, a full-duplex GPT-based model that enables real-time, natural voice conversations.
  • It employs a three-stage training process—modality alignment, half-duplex, and full-duplex dialogue learning—to merge speech and text inputs into unified sequences.
  • Experimental results highlight improved dialogue quality with a low 160ms response time, underscoring its efficiency in real-world applications.

OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation

Introduction

The development of full-duplex spoken dialogue systems marks a significant advancement over traditional turn-based systems, capturing the complexities of human dialogues more effectively by enabling simultaneous communication. These systems closely resemble human-human interactions through their ability to handle interruptions, backchannels, and overlapping speech. The task of achieving low latency and natural interaction within full-duplex systems remains challenging. The paper "OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation" (2410.17799) introduces OmniFlatten, a novel end-to-end GPT-based model that addresses these challenges by modeling complex conversational behaviors with high efficiency.

Methodology

OmniFlatten employs a multi-stage post-training method to convert a pre-existing text-based LLM into a speech-text dialogue model capable of real-time conversation generation. This process involves three key stages: modality alignment, half-duplex dialogue learning, and full-duplex dialogue learning, with data standardized using a flattening operation that merges multi-modal inputs into a unified sequence. Figure 1

Figure 1: The overview of OmniFlatten as an end-to-end full-duplex spoken dialogue model.

Audio Tokenization and Modality Alignment

The approach begins with audio tokenization, where continuous speech signals are converted into discrete speech tokens using a Vector Quantization layer, thus enabling the model to encode semantic information effectively. The initial training stage aims to align speech and text modalities, transforming the model into a multimodal entity proficient in both ASR and TTS. This involves fine-tuning the system through supervised learning tasks using diverse datasets to ensure accurate interpretation and generation capabilities.

Dialogue Learning

Half-duplex Dialogue Training

The OmniFlatten model undergoes initial training using half-duplex dialogues to align well with the modality framework set during alignment. This training simulates turn-based exchanges, allowing the model to predict interleaved chunks of conversation and refine its response-generation capabilities, as shown in Figure 2. Figure 2

Figure 2: Half-duplex Dialogue Training based on all four streams of speech and text tokens.

Full-duplex Dialogue Training

Subsequent training stages progressively adapt the model to full-duplex dialogue scenarios, first employing three-stream data (excluding user text), then moving to two-stream data (removing Assistant text stream). This staged process reduces reliance on intermediate text and focuses on speech-to-speech interactions, minimizing latency and mimicking real conversational dynamics consistently. Figure 3

Figure 3: Full-duplex Dialogue Training based on three streams of full-duplex dialogue data.

Figure 4

Figure 4: Full-duplex Dialogue Training based on two streams of full-duplex dialogue data.

Experimental Results

OmniFlatten exhibits impressive performance across its implemented procedures. The initial modality alignment stage demonstrates reasonable ASR and TTS accuracy, underscoring the system's ability to understand and produce speech without significant text dependency.

Subsequently, full-duplex training stages showcase enhancements in dialogue quality, with evaluations conducted using high-performing LLMs indicating substantive improvements in conversational flow and response accuracy. The model achieves a remarkable average Assistant turn-taking response time of 160ms, indicative of low latency.

Conclusion

OmniFlatten's approach to full-duplex spoken dialogue system development presents a promising direction for achieving natural, human-like interactions in AI-driven models without architectural modifications to existing LLMs. Future work will explore further data synthesis improvements and modality expansions, including visual capabilities, to enrich multi-modal conversational agents.

In conclusion, OmniFlatten's streamlined training approach offers significant implications for developing efficient full-duplex systems capable of real-time, natural interactions, providing a template for future evolutions in AI dialogue technologies.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.