VoiceAssistant-400K Dataset Overview
- VoiceAssistant-400K is a purpose-built multimodal dataset containing ≈400K synthetic English dialogue examples organized into paired audio-text quadruples.
- The dataset is generated using GPT-4o for text and zero-shot TTS for audio, ensuring natural alignment between questions and answers.
- It plays a critical role in the Mini-Omni training pipeline by fine-tuning end-to-end, speech-targeted question-answering models in real time.
VoiceAssistant-400K is a supervised multimodal dataset introduced by Xie et al. in the context of Mini-Omni, an audio-based, end-to-end conversational LLM. Distinct from traditional TTS or ASR corpora, VoiceAssistant-400K is constructed specifically to fine-tune models for real-time, speech-targeted question answering in the style of a modern voice assistant. The dataset exclusively involves English dialogues synthetically generated and rendered by GPT-4o and zero-shot TTS systems, and comprises quadruple-modality items to facilitate alignment between and across speech and text modalities (Xie et al., 2024).
1. Dataset Composition and Structural Properties
VoiceAssistant-400K consists of approximately 400,000 examples, precisely structured as follows:
| Statistic | Value/Aggregate Details |
|---|---|
| Total examples | ≈367,000 RLHF-based + ≈39,000 MC/OpenAssistant (~406,000 total) |
| Modalities per example | Audio question (A₁), Text question (T₁), Audio answer (A₂), Text answer (T₂) |
| Languages | English only (single-language) |
| Voice sources | Zero-shot TTS (synthetic, not human speakers) |
| Speaker demographics | Not provided; no diversity metrics reported |
Each sample is a quadruple: an audio-formulated question (A₁), a written question (T₁), a spoken answer (A₂), and a written answer (T₂). Both questions and answers originate from GPT-4o and are converted to audio using a zero-shot TTS engine. The dataset does not report total audio hours, nor does it supply information regarding the demographic diversity of synthetic voices [(Xie et al., 2024), Sect. 4.1].
2. Data Generation and Processing Pipeline
The construction pipeline initiates with GPT-4o prompted to simulate authentic voice assistant QA dialogues, intentionally excluding code snippets and excessively lengthy passages. Subsequently, each text question-answer pair is synthesized into speech through zero-shot TTS. Unlike datasets that archive raw auditory waveforms, each audio stream is on-the-fly discretized into a sequence of SNAC (seven codebook layer) codec tokens at training time, facilitating multimodal token alignment with text BPE tokens [(Xie et al., 2024), Sect. 2, 3.2].
There is no mention of manual curation or complex quality filtering. The data pipeline relies on the generative stability of GPT-4o outputs and the consistency of the zero-shot TTS synthesis process. Text normalization protocols, sampling rates, and codec specifications at the point of synthesis are not explicitly detailed in the source publication.
3. Annotation Schema and Alignment
Each dataset item is annotated with four synchronized streams, designated as A₁ (audio question), T₁ (text question), A₂ (audio answer), and T₂ (text answer). Alignment between text and audio streams occurs naturally, as every text token sequence generated by GPT-4o corresponds to a directly synthesized audio sequence; however, there is no published token- or frame-level forced alignment. The only explicit metadata per sample beyond the four modality tags is a special <answer> token signifying the beginning of each answer in the streams [(Xie et al., 2024), Table 1; Fig. 4].
There is no explicit description of additional annotation such as speaker characteristics, utterance style tags, or environmental noise information. All utterances used are “clean” English conforming to typical assistant dialogue expectations.
4. Dataset Splits and Accessibility
VoiceAssistant-400K does not include designated train, validation, or test splits. The entire corpus is leveraged in the “final” voice QA fine-tuning stage of Mini-Omni. No cross-validation or held-out evaluation partition is reported. The dataset is intended for release under an open-source license, pending precise specification, and will be available (alongside code) via https://github.com/gpt-omni/mini-omni. There are no stated access restrictions outside those typical of a public code repository. Citation of the Mini-Omni paper is requested for academic use [(Xie et al., 2024), Table 1; Conclusion].
5. Role in Mini-Omni Training Methodology
VoiceAssistant-400K is exclusively employed during the third and final stage of the Mini-Omni multi-stage training recipe, summarized as follows:
- Stage 1 (modality alignment): Train adapters on external ASR/TTS data.
- Stage 2 (adapter adaptation): Freeze adapters and train the core model on cross-modal QA (speech in → text out).
- Stage 3 (multi-modal fine-tuning): Unfreeze entire model and incorporate VoiceAssistant-400K. Both audio and text streams are included, with loss computed jointly over both modalities.
The fine-tuning protocol encodes audio with SNAC tokens, text with BPE tokens, and applies a joint negative log-likelihood objective operationally identical to:
where (the total number of samples) is approximately 400,000, and indicates the token-sequence length for item [(Xie et al., 2024), Sect. 3.1].
No isolated ablation for the dataset is provided, but the full-stage model achieves speech output “on par with common TTS systems” and maintains original language functionality with “minimal degradation” after VoiceAssistant-400K-based fine-tuning.
6. Distinctiveness and Limitations
VoiceAssistant-400K is the only purpose-built, speech-targeted supervised fine-tuning set described for Mini-Omni. Unlike legacy ASR/TTS datasets, it is intentionally designed to elicit output imitating a “friendly, concise voice assistant.” All content is synthetic, with both prompts and TTS conversion generated from model outputs rather than human performance. No explicit metrics (e.g., speaker diversity, accent inclusion, or environmental robustness) are provided, and the data is exclusively monolingual in English.
A plausible implication is that while the dataset is well-suited for modeling stylistically appropriate, real-time assistant speech, it may not generalize across heterogeneous real-world speaker conditions or broader linguistic settings without augmentation. Lack of explicit train/test splits or demographic metadata should be considered when benchmarking broader conversational AI systems with this resource.
7. Broader Context and Applications
VoiceAssistant-400K addresses a critical bottleneck in the real-time multimodal modeling pipeline—specifically, the inability of academic large language/hybrid models to generate streaming, TTS-grade speech output without auxiliary external systems. By integrating this dataset during joint text–audio fine-tuning, Mini-Omni achieves end-to-end, rapid conversational capabilities with audio generation at minimal latency. This operational framework offers a foundation for subsequent research targeting streaming, open-source, and user-interactive LLMs capable of coherent speech production. The expected open release is positioned to facilitate widespread experimentation and further methodical development in streaming multimodal LLMs (Xie et al., 2024).