Papers
Topics
Authors
Recent
Search
2000 character limit reached

Libri-Light Corpus for ASR Research

Updated 7 January 2026
  • Libri-Light is a large-scale spoken English corpus curated from LibriVox audiobooks, providing over 60,000 hours of segmented audio and extensive metadata for robust ASR research.
  • It features diverse evaluation protocols across zero-resource, semi-supervised, and distant-supervision settings, employing metrics such as ABX, PER/CER, and WER for thorough benchmarking.
  • The dataset offers standardized 16 kHz FLAC recordings and labeled subsets, enabling practical applications in unsupervised representation learning and fine-tuning of ASR systems.

Libri-Light is a large-scale spoken English corpus specifically curated for the development and benchmarking of automatic speech recognition (ASR) systems in settings with limited or no supervision. Constructed from open-source audiobooks from the LibriVox project, it comprises over 60,000 hours of segmented, read speech audio in the public domain. The resource is characterized by extensive metadata, diverse speaker representation, and the provision of multiple evaluation protocols designed to probe zero-resource, semi-supervised, and distant-supervision ASR learning scenarios (Kahn et al., 2019).

1. Data Compilation, Structure, and Metadata

The Libri-Light corpus sources all recordings from LibriVox, consisting of read-speech from out-of-copyright books under an open-source (public domain) license. The final unlabelled data pool encompasses approximately 60,000 hours of audio after duplicate removal, involving 7,439 unique speakers in the largest split. The dataset is released in three nested unlabelled splits:

Split Hours Speakers
unlab-60k 57,706 7,439
unlab-6k 5,770 1,742
unlab-600 577 489

All recordings are standardized to 16 kHz FLAC format. Segmentation utilizes a CTC-trained TDS acoustic model (wav2letter++), generating frame-level posteriors for SPEECH versus NONSPEECH. Speech segments of at least 500 ms are extracted to form the dataset units. Each segment is paired with a JSON metadata file containing the following fields: speaker_id, book_id, title, macro-genre (seven categories), average SNR (computed from VAD-labeled speech/noise frames), optional perplexity (LM score), and the list of VAD-detected blocks.

Labeled subsets are included for supervised and semi-supervised benchmarking: train-10h (10 hours), train-1h (1 hour), train-10m (six 10-minute slices), each with corresponding .flac, .trans.txt (orthographic transcript), and .phones.txt (phonetic transcript) files. Additionally, an unaligned text corpus, librispeech-LM.txt (800 million tokens, 200,000 vocabulary, sourced from 14,500 Project Gutenberg books), supports distant-supervision experiments.

2. Benchmark Protocols and Evaluation Metrics

Libri-Light is designed to facilitate benchmarking under three distinct supervision regimes, all reporting results on standard LibriSpeech dev and test sets ("dev-clean," "dev-other," "test-clean," "test-other"):

2.1 Zero-resource / Unsupervised Setting

The objective in this protocol is to learn speech representations without access to textual resources. The principal evaluation measure is minimal-pair ABX phone discrimination, assessed both within- and across-speaker conditions. The ABX score, based on comparing vector distances between phonetic categories, is defined as:

θ(x,y)=1nm(m−1)∑a∈S(x)∑b∈S(y)∑c∈S(x)∖{a}[1(d(a,c)<d(a,b))+121(d(a,c)=d(a,b))]\theta(x, y) = \frac{1}{n m (m-1)} \sum_{a \in S(x)} \sum_{b \in S(y)} \sum_{c \in S(x) \setminus \{a\}} \left[ \mathbf{1}\bigl(d(a, c) < d(a, b)\bigr) + \frac{1}{2} \mathbf{1}\bigl(d(a, c) = d(a, b)\bigr) \right]

ABX error is computed as average(x,y)[1−θ(x,y)]\mathrm{average}_{(x,y)}[1-\theta(x,y)], with lower scores indicating better discrimination. Typical baseline (MFCC) performance yields ABX of approximately 11% within and 21% across speakers on dev-clean, while CPC-trained representations on unlab-60k reach approximately 6.1% and 8.1%, respectively.

2.2 Semi-supervised Setting

This track utilizes limited aligned speech-text data (10 minutes, 1 hour, or 10 hours) from LibriSpeech. Systems are trained as phoneme-CTC or character-CTC listeners. Performance is evaluated using Phoneme Error Rate (PER) and Character Error Rate (CER), computed with the Levenshtein distance:

System dev-clean dev-other test-clean test-other
no pretrain + train-10h 45.9 55.7 43.7 58.6
CPC unlab60k + train-10h 28.4 41.4 27.9 43.6

2.3 Distant-supervision Setting

In this regime, large quantities of unlabelled speech and unaligned text (librispeech-LM.txt) are employed to build ASR models using a CTC acoustic model and a 4-gram KenLM LLM for decoding. The main metric is Word Error Rate (WER):

WER=S+D+IN×100%,WER = \frac{S + D + I}{N} \times 100\%,

where S=S= substitutions, D=D= deletions, I=I= insertions, N=∣N=|reference∣|. Baseline results include:

System test-clean WER
Supervised 1000h (GatedConv+4-gram) 4.8
CPC unlab-60k + train-10h + 4-gram 43.9
MFSC TDS train-10h + 60k pseudo-labels + 4-gram 30.1
MFSC TDS train-10h + 60k pseudo (phoneme) 29.3

3. Baseline Systems and Training Methodologies

3.1 Unsupervised (CPC) Baseline

The baseline Contrastive Predictive Coding (CPC) model comprises:

  • Encoder gcg_c: 5 convolutional layers (strides [5,4,2,2,2], filter sizes [10,8,4,4,4], 256 units), outputting one frame per 10 ms.
  • Autoregressor garg_{ar}: single-layer LSTM (256 dimensional).
  • Predictor gpg_p: one-layer Transformer, trained with a contrastive loss to predict frames k=1,…,12k=1,\ldots,12 ahead. Training is performed on unlab-600, unlab-6k, and unlab-60k (≈2 days on 128×V100 GPUs, batch size 32×1,280 ms sequences per GPU).

3.2 Semi-supervised (CPC+CTC) Baseline

The CPC encoder and autoregressor are frozen. A linear layer with CTC loss is appended on garg_{ar} outputs and fine-tuned on phonetic labels of lengths 10 min, 1 h, or 10 h. As control, an identical architecture is trained end-to-end from scratch.

3.3 Distant-supervision Baseline

(A) CPC Pretrain + CTC (LSTM) + 4-gram LM:

The CTC network is fine-tuned on limited labels and decoded with wav2letter++ beam search and KenLM 4-gram.

(B) Mel-Filterbank + TDS CTC + Pseudo-labels:

A small TDS model (20M parameters, stride=2, blocks [2,2,3]) is trained on limited labels, used to beam search generate pseudo-labels for unlab-60k (36 s segments), and a larger TDS (37M parameters, 11 blocks, stride=2) is then retrained from scratch on this pseudo-labeled corpus. Optimization: SGD, momentum=0.5, LR=0.1 halved every 30 epochs, 150 total epochs; Dropout 0.4 (20M) and 0.1 (37M); Beam size 1,000.

4. Comparison with Existing Speech Resources

Libri-Light’s scale and annotation distinguish it from contemporaneous corpora:

Corpus Unsupervised Hours Languages Transcript Coverage Metadata Scope
Libri-Light 60,000 1 (en) Partially transcribed SNR, VAD, genre, speaker ID
LibriSpeech 1,000 1 (en) Fully transcribed Minimal metadata
CommonVoice 2,900 37 Mixed Typically minimal

Libri-Light is primarily untranscribed, but offers labeled slices (10 minutes to 10 hours) for semi-supervised and distant-supervision work, as well as extensive metadata for SNR, VAD, genre, and speaker identity. Principal use cases include zero-resource acoustic representation research, semi-supervised ASR pretraining, and large-scale pseudo-labeled training via distant-supervision.

5. Guidelines and Recommendations for Usage

  • Zero-resource Research: Utilize unlab-600, unlab-6k, and unlab-60k to systematically study embedding quality as a function of data scale. Official VAD segments are suitable for frame extraction; SNR tags assist with filtering noisy segments. ABX evaluations should use the provided forced-alignments on dev/test sets.
  • Semi-supervised Learning: Adopt CPC or alternative pretrained models, fine-tuned sequentially from train-10m to train-10h. Both orthographic and phonetic transcripts are provided, supporting CTC and sequence-to-sequence listeners. Standard Levenshtein aligners are recommended for PER and CER computation.
  • Distant-supervision Strategies: Train 4-gram LLMs with librispeech-LM.txt (KenLM). Fine-tune acoustic models on limited labeled data, produce pseudo-labels on the unlab-60k split, and retrain larger models on the combined set.
  • Metadata Utilization: SNR enables quality-based file filtering. Genre tags support the creation of domain-balanced subsets and facilitate domain adaptation experiments. Use speaker_id fields to ensure speaker disjointness in custom train/test splits.

All datasets, metadata schema, preprocessing scripts, evaluation tools (ABX, PER/CER/WER calculators), and baseline recipes are publicly accessible at https://github.com/facebookresearch/libri-light (Kahn et al., 2019).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Libri-Light Corpus.