OV-Speech: Annotated InstructTTS Dataset
- OV-Speech is a large-scale dataset comprising Mandarin audiobook utterances enriched with contextual, instruction-driven annotations for refined text-to-speech synthesis.
- It employs a rigorous, multi-stage annotation protocol that includes instruction deconstruction, LLM-generated reasoning chains, and paralinguistic event tagging to boost model interpretability.
- The dataset supports diverse directive types—emotional, acoustic, paralinguistic—with detailed acoustic metrics such as F0, phoneme duration, and speaking rate for comprehensive TTS analysis.
OV-Speech is a large-scale, richly annotated InstructTTS dataset that provides single-sentence speech utterances paired with open-vocabulary director-style instructions, explicit reasoning chains mapping instructions to low-level acoustic features, and transcriptions augmented with paralinguistic event tags. Built on the ContextSpeech Mandarin audiobook corpus, OV-Speech advances instruction-conditioned text-to-speech research by supporting complex, context-grounded generation via reasoning-driven annotation. The dataset and accompanying framework are described in "OV-InstructTTS: Towards Open-Vocabulary Instruct Text-to-Speech" (Ren et al., 4 Jan 2026).
1. Dataset Composition and Scope
OV-Speech is constructed atop ContextSpeech, containing 476.8 hours of Mandarin audiobook audio aligned to novel text, spanning 83 novels and an estimated 100–200 unique narrators. Each utterance is contextualized to its narrative environment and associated with multiple annotation layers.
The primary data split structure is as follows:
| Split | Utterances | Instructions | #Speakers | Avg. Length (s) |
|---|---|---|---|---|
| Train | 316,807 | 950,421 | ~150 | 2.8 ± 1.2 |
| Test | 1,500 | 4,500 | held-out | 2.7 ± 1.3 |
Each training utterance is paired with three diverse open-vocabulary instructions, yielding 950,421 distinct instruction samples; each test utterance with three instructions, totaling 4,500. There is no designated validation set.
Instructions are categorized for analysis into three high-level types: emotional (48%), acoustic (34%), and paralinguistic (18%). Emotional instructions (e.g., “sound hopeful and relieved”) dominate, followed by acoustic (e.g., “speak more slowly and softly”) and explicit paralinguistic directives (e.g., “insert a light cough before speaking”) (Ren et al., 4 Jan 2026).
2. Data Structure and Annotation Protocol
Each data point is encapsulated as a JSON record with the following fields:
audio_filepath: 16 kHz, 16-bit mono WAV.transcript_raw: UTF-8 text utterance.transcript_tagged: transcript augmented with in-line paralinguistic tags (e.g.,“Why did you <|Breathing|> choose such a servant?”).instruction: open-vocabulary, context-sensitive director-style phrase.reasoning_chain: a stepwise explanation mapping the instruction to acoustic and emotional characteristics.labels: includes discrete emotion label(s), quantized acoustic descriptors (e.g., pitch_shape, rate), and present paralinguistic tags.
Annotation follows a five-stage LLM-facilitated protocol:
- Context Extraction: Utterance alignment and ±1,000 word narrative context extraction.
- Context Distillation: Qwen3-32B performs environment, event, personality, interlocutor, and intent parsing.
- Instruction Generation: Elements from step 2 are sampled and composed into director-style instructions via Qwen3-32B.
- Consistency Filtering: Deepseek-R1 predicts emotional/acoustic attributes; Qwen3-32B judges correspondence, discarding samples with <6/10 emotion or <5/10 acoustic alignment scores.
- Reasoning & Paralinguistic Annotation: Qwen3-32B generates two-part reasoning chains (deconstruction and attribute inference). Paralinguistic tags are inserted using Qwen2-Audio-7B fine-tuned with PC-PTI on NVSpeech170k, covering 18 paralinguistic event types.
3. Acoustic Feature Specification
Although the dataset does not provide precomputed continuous acoustic profiles, it is structured for downstream extraction and analysis of standard features:
- Fundamental frequency contour: in Hz.
- Phoneme duration: .
- Signal energy: .
- Speaking rate: .
- Intensity: in dB.
Summary statistics for the training set include:
| Feature | Mean | Std-Dev |
|---|---|---|
| (Hz) | 148.2 | 32.5 |
| (ms) | 75.4 | 28.7 |
| (a.u.) | 0.018 | 0.005 |
| (phones/s) | 5.1 | 1.2 |
Recommended practice involves normalization of feature extraction to consistent analysis windows, such as vowel nuclei for (Ren et al., 4 Jan 2026).
4. Reasoning Chain Annotation Mechanism
A central innovation is explicit, LLM-generated “reasoning_chain” annotations. Each chain uses a two-phase structure:
A. Instruction Deconstruction: Identifies which contextual factors—environment, event, personality, interlocutor, or intent—the instruction leverages.
B. Attribute Inference: Maps the high-level directive to discrete and quantized acoustic/emotional targets: - Explicit emotion class(es) (e.g., joy, contempt). - Pitch dynamics (), range description. - Rate and energy adjustments quantified, e.g., “–10% slower,” “–6 dB.” - Paralinguistic event types and their placements.
For example, the chain for “speak more softly” details inferred intent (calm reassurance), setting (private conversation), emotion (gentle), pitch (slightly falling, decrease ≃ –15 Hz), unchanged rate, and energy reduction by 6 dB ().
This structure enables multi-step supervision and interpretability for instruction-following TTS architectures (Ren et al., 4 Jan 2026).
5. Paralinguistic Event Tagging and Representation
Transcripts are augmented to mark paralinguistic events, such as breathing, coughing, and other vocal gestures, using token insertions (e.g., <|Breathing|>, <|Cough|>). Tags are derived via a two-stage pipeline: Qwen2-Audio-7B is fine-tuned with PC-PTI on NVSpeech170k to first classify the event type and then insert tags at temporal offsets. Eighteen distinct paralinguistic events from NVSpeech170k are represented (Ren et al., 4 Jan 2026).
Paralinguistic tag presence is indexed in the structured label field and directly exposes temporal event positions for fine-grained modeling.
6. Licensing, Access, and Recommended Practices
OV-Speech is distributed for non-commercial research under the CC BY-NC-4.0 license. Public access to audio and metadata (JSON+WAV) is provided via the project site (https://y-ren16.github.io/OV-InstructTTS) and a HuggingFace mirror (“Insects/OV-Speech”).
- File formats: Audio is stored as 16 kHz mono WAV; metadata as JSON lines reflecting the annotation schema.
- Usage: Splits should be preserved for reproducibility; users may carve additional validation subsets if required.
- Research facilitation: Reasoning chains are intended to supervise multistep or reasoning-augmented TTS models, or to support Large Audio LLM (LALM) fine-tuning.
A plausible implication is that the presence of explicit reasoning chains and rich event tagging supports research into interpretable and controllable TTS systems, highlighting OV-Speech’s utility in developing user-centric, instruction-driven speech synthesis models (Ren et al., 4 Jan 2026).
For distinction from the distinct “VOICES” (sometimes inconsistently referred to as “OV-Speech” in several repositories), see (Richey et al., 2018), which is a far-field English speech corpus for noise-robust modeling, not an InstructTTS dataset (Richey et al., 2018).