LibriEdit: Controllable TTS Dataset
- LibriEdit is a speech dataset that uses a delta-pair paradigm to allow selective manipulation of paralinguistic attributes in TTS generation.
- It employs meticulous segmentation via forced alignment and robust emotion annotation using multi-model agreement to ensure precise data units.
- The dataset comprises 708 hours of annotated audio with controls over emotion, pitch, energy, and speaker embeddings to facilitate advanced neural speech editing.
LibriEdit is a specialized speech dataset constructed to enable selective attribute control in editable text-to-speech (TTS) generation. Developed to address the need for fine-grained control over paralinguistic characteristics—such as emotion, prosody, and speaker identity—LibriEdit introduces a delta (difference-aware) pairing paradigm. It provides researchers with large-scale, meticulously annotated training data for the explicit modeling and manipulation of individual acoustic features, supporting advances in controllable neural codec language modeling for speech generation tasks (Pei et al., 18 Jan 2026).
1. Source Material and Segmentation
LibriEdit is derived from the LibriHeavy corpus, which comprises 50,000 hours of read speech with verified speaker identities. The dataset construction begins by selecting the “large” split of LibriHeavy, restricting segments to those at least 2 seconds in duration. Segmentation leverages the Montreal Forced Aligner to conduct precise splits based on breath groups and punctuation-aligned pauses, resulting in prosodically coherent units. This hierarchical segmentation process—transitioning from audiobook chapters to sentences and then to prosodic segments—ensures consistent minimal duration and granularity suited for attribute annotation and style transfer (see Sect. 4.2 and related workflow).
2. Attribute Annotation and Filtering
Emotion and prosody are systematically annotated at the segment level. An 8-way speech emotion recognition (SER) model predicts among {Neutral, Happy, Sad, Angry, Surprise, Fear, Disgust, Contempt}. The dataset discards segments labeled as Fear, Disgust, and Contempt to avoid low perceptual consistency. Confidence thresholds, visualized in Fig. 4 (left), are applied per emotion category, and additional validation employs two cross-check classifiers (emotion2Vec-plus-large; Audio Flamingo 3). Only segments with at least two out of three models in agreement are retained, resulting in 129 hours of emotion-labeled data across five final classes. Continuous prosodic features—fundamental frequency (), energy, and speaking rate—are extracted via standard DSP tools and quantized to five ordinal levels: {Very Low, Low, Medium, High, Very High}. This dual-stage annotation addresses both the reliability and the granularity of attribute control required for downstream applications.
3. Dataset Composition and Delta-Pair Sampling Protocol
LibriEdit totals 708 hours, covering 2,566 unique speakers (Appendix Table A.1). The subset with emotion annotation (129 hours) reflects natural imbalances from audiobook narration (Sad being most frequent). Nearly all segments are labeled with prosodic attributes. Training pairs, or delta-pairs (Editor's term), are sampled with equal probability from same-speaker and cross-speaker combinations (ratio 1:1), ensuring the model is exposed both to within-speaker and inter-speaker style variations (Sect. 3.2 & 5.1). No fixed train/dev/test splits are specified; in experimental usage, the entire dataset is employed for delta-pair training following VALL-E style pre-training.
| Feature | Value/Description | Source Section |
|---|---|---|
| Total hours | 708 | Sect. 4.1, Appendix A.1 |
| Speaker IDs | 2,566 | Sect. 4.1 |
| Emotion-labeled hours | 129 (5 categories) | Step 2, Fig. 4 |
| Attributes | Emotion, pitch, energy, speed | Sect. 4.1 |
| Pair sampling mix | 50% same-speaker / 50% cross-speaker | Sect. 3.2/5.1 |
4. Data Representation, Metadata Schema, and API
Audio is sampled at 24 kHz and tokenized using the EnCodec codec at 6 kbps; each frame consists of 8 codebook indices, yielding a waveform representation as (Sect. 3). Categorical control instructions utilize dedicated BPE tokens (e.g., “<pitch-high>”, “<Angry>”), integrated into the shared text embedding space. Speaker identity is encoded as a 384-dimensional global speaker embedding, , extracted from a pretrained voice-print system and mapped into the model’s embedding space.
Per-segment metadata is distributed in CSV/JSON and includes:
- segment_id
- audio_path
- speaker_id
- emotion_label ∈ {Neutral, Happy, Sad, Angry, Surprise}
- emotion_confidence ∈ [0,1]
- pitch_level, energy_level, speed_level ∈ {1…5}
- duration_seconds
A dataset loader in Python yields delta pairs as objects containing the audio tokens, the source text, style difference tags, and the target audio representation (see code example in the paper).
5. Delta-Pair Triplets and Training Input Construction
Each training instance takes the form , where:
- : source utterance codec tokens
- : transcript of source
- : set of attribute tags reflecting all style attributes where , encoded as control instructions
- : target utterance codec tokens (teacher forcing)
The model input sequence is:
This enables localized control: only the attributes flagged in are modified between prompt and target, fostering disentanglement in TTS synthesis (Pei et al., 18 Jan 2026).
6. Licensing, Accessibility, and Known Limitations
LibriEdit is distributed under the CC-BY-NC 4.0 license, inheriting licensing conditions from LibriHeavy. Annotation files, preprocessed audio tokens, and source code are available at https://speech-editing.github.io/speech-editing/. Usage is confined to non-commercial research, with explicit attribution required.
Limitations include a skewed emotional class distribution favoring Sad, exclusion of Fear, Disgust, and Contempt emotions, and restriction to read-speech (no conversational or noisy audio). Prosody estimation based on DSP may introduce errors at segment boundaries. The global speaker embedding captures fixed speaker identity but may not represent time-varying speaker characteristics; only five emotions and three prosodic attributes are currently modeled. Suggested future extensions encompass natural-language control instructions, dynamic or flow-based speaker representations, expanded paralinguistic annotation (e.g., breathiness, accent), and the inclusion of multi-speaker conversational speech to broaden coverage.
7. Context and Research Relevance
LibriEdit establishes a new benchmark for controllable TTS data by introducing difference-aware (delta) attribute modeling at scale. This design explicitly addresses the limitations of holistic acoustic imitation by enabling localized intervention in latent attribute space—a key capability for advancing research on editable neural codec LLMs, selective TTS editing, and paralinguistic style transfer (Pei et al., 18 Jan 2026). The delta-pair paradigm generalizes to scenarios where detailed and compositional control over output speech is essential, setting the stage for further research into disentangleable speech editing and customizable voice generation.