Libri-Light Corpus for ASR Research
- Libri-Light is a large-scale spoken English corpus curated from LibriVox audiobooks, providing over 60,000 hours of segmented audio and extensive metadata for robust ASR research.
- It features diverse evaluation protocols across zero-resource, semi-supervised, and distant-supervision settings, employing metrics such as ABX, PER/CER, and WER for thorough benchmarking.
- The dataset offers standardized 16 kHz FLAC recordings and labeled subsets, enabling practical applications in unsupervised representation learning and fine-tuning of ASR systems.
Libri-Light is a large-scale spoken English corpus specifically curated for the development and benchmarking of automatic speech recognition (ASR) systems in settings with limited or no supervision. Constructed from open-source audiobooks from the LibriVox project, it comprises over 60,000 hours of segmented, read speech audio in the public domain. The resource is characterized by extensive metadata, diverse speaker representation, and the provision of multiple evaluation protocols designed to probe zero-resource, semi-supervised, and distant-supervision ASR learning scenarios (Kahn et al., 2019).
1. Data Compilation, Structure, and Metadata
The Libri-Light corpus sources all recordings from LibriVox, consisting of read-speech from out-of-copyright books under an open-source (public domain) license. The final unlabelled data pool encompasses approximately 60,000 hours of audio after duplicate removal, involving 7,439 unique speakers in the largest split. The dataset is released in three nested unlabelled splits:
| Split | Hours | Speakers |
|---|---|---|
| unlab-60k | 57,706 | 7,439 |
| unlab-6k | 5,770 | 1,742 |
| unlab-600 | 577 | 489 |
All recordings are standardized to 16 kHz FLAC format. Segmentation utilizes a CTC-trained TDS acoustic model (wav2letter++), generating frame-level posteriors for SPEECH versus NONSPEECH. Speech segments of at least 500 ms are extracted to form the dataset units. Each segment is paired with a JSON metadata file containing the following fields: speaker_id, book_id, title, macro-genre (seven categories), average SNR (computed from VAD-labeled speech/noise frames), optional perplexity (LM score), and the list of VAD-detected blocks.
Labeled subsets are included for supervised and semi-supervised benchmarking: train-10h (10 hours), train-1h (1 hour), train-10m (six 10-minute slices), each with corresponding .flac, .trans.txt (orthographic transcript), and .phones.txt (phonetic transcript) files. Additionally, an unaligned text corpus, librispeech-LM.txt (800 million tokens, 200,000 vocabulary, sourced from 14,500 Project Gutenberg books), supports distant-supervision experiments.
2. Benchmark Protocols and Evaluation Metrics
Libri-Light is designed to facilitate benchmarking under three distinct supervision regimes, all reporting results on standard LibriSpeech dev and test sets ("dev-clean," "dev-other," "test-clean," "test-other"):
2.1 Zero-resource / Unsupervised Setting
The objective in this protocol is to learn speech representations without access to textual resources. The principal evaluation measure is minimal-pair ABX phone discrimination, assessed both within- and across-speaker conditions. The ABX score, based on comparing vector distances between phonetic categories, is defined as:
ABX error is computed as , with lower scores indicating better discrimination. Typical baseline (MFCC) performance yields ABX of approximately 11% within and 21% across speakers on dev-clean, while CPC-trained representations on unlab-60k reach approximately 6.1% and 8.1%, respectively.
2.2 Semi-supervised Setting
This track utilizes limited aligned speech-text data (10 minutes, 1 hour, or 10 hours) from LibriSpeech. Systems are trained as phoneme-CTC or character-CTC listeners. Performance is evaluated using Phoneme Error Rate (PER) and Character Error Rate (CER), computed with the Levenshtein distance:
| System | dev-clean | dev-other | test-clean | test-other |
|---|---|---|---|---|
| no pretrain + train-10h | 45.9 | 55.7 | 43.7 | 58.6 |
| CPC unlab60k + train-10h | 28.4 | 41.4 | 27.9 | 43.6 |
2.3 Distant-supervision Setting
In this regime, large quantities of unlabelled speech and unaligned text (librispeech-LM.txt) are employed to build ASR models using a CTC acoustic model and a 4-gram KenLM LLM for decoding. The main metric is Word Error Rate (WER):
where substitutions, deletions, insertions, reference. Baseline results include:
| System | test-clean WER |
|---|---|
| Supervised 1000h (GatedConv+4-gram) | 4.8 |
| CPC unlab-60k + train-10h + 4-gram | 43.9 |
| MFSC TDS train-10h + 60k pseudo-labels + 4-gram | 30.1 |
| MFSC TDS train-10h + 60k pseudo (phoneme) | 29.3 |
3. Baseline Systems and Training Methodologies
3.1 Unsupervised (CPC) Baseline
The baseline Contrastive Predictive Coding (CPC) model comprises:
- Encoder : 5 convolutional layers (strides [5,4,2,2,2], filter sizes [10,8,4,4,4], 256 units), outputting one frame per 10 ms.
- Autoregressor : single-layer LSTM (256 dimensional).
- Predictor : one-layer Transformer, trained with a contrastive loss to predict frames ahead. Training is performed on unlab-600, unlab-6k, and unlab-60k (≈2 days on 128×V100 GPUs, batch size 32×1,280 ms sequences per GPU).
3.2 Semi-supervised (CPC+CTC) Baseline
The CPC encoder and autoregressor are frozen. A linear layer with CTC loss is appended on outputs and fine-tuned on phonetic labels of lengths 10 min, 1 h, or 10 h. As control, an identical architecture is trained end-to-end from scratch.
3.3 Distant-supervision Baseline
(A) CPC Pretrain + CTC (LSTM) + 4-gram LM:
The CTC network is fine-tuned on limited labels and decoded with wav2letter++ beam search and KenLM 4-gram.
(B) Mel-Filterbank + TDS CTC + Pseudo-labels:
A small TDS model (20M parameters, stride=2, blocks [2,2,3]) is trained on limited labels, used to beam search generate pseudo-labels for unlab-60k (36 s segments), and a larger TDS (37M parameters, 11 blocks, stride=2) is then retrained from scratch on this pseudo-labeled corpus. Optimization: SGD, momentum=0.5, LR=0.1 halved every 30 epochs, 150 total epochs; Dropout 0.4 (20M) and 0.1 (37M); Beam size 1,000.
4. Comparison with Existing Speech Resources
Libri-Light’s scale and annotation distinguish it from contemporaneous corpora:
| Corpus | Unsupervised Hours | Languages | Transcript Coverage | Metadata Scope |
|---|---|---|---|---|
| Libri-Light | 60,000 | 1 (en) | Partially transcribed | SNR, VAD, genre, speaker ID |
| LibriSpeech | 1,000 | 1 (en) | Fully transcribed | Minimal metadata |
| CommonVoice | 2,900 | 37 | Mixed | Typically minimal |
Libri-Light is primarily untranscribed, but offers labeled slices (10 minutes to 10 hours) for semi-supervised and distant-supervision work, as well as extensive metadata for SNR, VAD, genre, and speaker identity. Principal use cases include zero-resource acoustic representation research, semi-supervised ASR pretraining, and large-scale pseudo-labeled training via distant-supervision.
5. Guidelines and Recommendations for Usage
- Zero-resource Research: Utilize unlab-600, unlab-6k, and unlab-60k to systematically study embedding quality as a function of data scale. Official VAD segments are suitable for frame extraction; SNR tags assist with filtering noisy segments. ABX evaluations should use the provided forced-alignments on dev/test sets.
- Semi-supervised Learning: Adopt CPC or alternative pretrained models, fine-tuned sequentially from train-10m to train-10h. Both orthographic and phonetic transcripts are provided, supporting CTC and sequence-to-sequence listeners. Standard Levenshtein aligners are recommended for PER and CER computation.
- Distant-supervision Strategies: Train 4-gram LLMs with librispeech-LM.txt (KenLM). Fine-tune acoustic models on limited labeled data, produce pseudo-labels on the unlab-60k split, and retrain larger models on the combined set.
- Metadata Utilization: SNR enables quality-based file filtering. Genre tags support the creation of domain-balanced subsets and facilitate domain adaptation experiments. Use speaker_id fields to ensure speaker disjointness in custom train/test splits.
All datasets, metadata schema, preprocessing scripts, evaluation tools (ABX, PER/CER/WER calculators), and baseline recipes are publicly accessible at https://github.com/facebookresearch/libri-light (Kahn et al., 2019).