Papers
Topics
Authors
Recent
Search
2000 character limit reached

LRS Dataset: Audio-Visual Lip Reading

Updated 28 January 2026
  • LRS Dataset is a large-scale audio-visual corpus of UK television broadcasts that supports sentence-level lip reading and audio-visual speech recognition.
  • It provides hundreds of hours of face-cropped videos paired with aligned transcripts, covering diverse speakers, genres, accents, and challenging in-the-wild conditions.
  • Benchmark evaluations include detailed metrics like CER, WER, and SAR, while supporting various architectures such as CTC-based and seq2seq transformer models.

Lip Reading Sentences (LRS) Dataset refers to a family of large-scale audio-visual corpora of unconstrained spoken sentences extracted from UK television broadcasts. These datasets, pioneered by Chung et al., serve as the principal benchmarks for sentence-level visual speech recognition and audio-visual Automatic Speech Recognition (AVASR), supporting both audio-only, video-only (lip reading), and audio-visual modeling. The datasets, including LRS, LRS2, and LRS2-BBC, feature hundreds of hours of face-cropped videos synchronized with aligned text transcripts, supporting open-vocabulary, continuous-speech recognition tasks "in the wild" across diverse speakers, genres, accents, and environmental conditions.

1. Data Sources, Collection, and Annotation

LRS and its successors (notably LRS2 and LRS2-BBC) were constructed by systematically mining thousands of hours of British television archives, predominantly BBC news, debates, interviews, documentaries, and talk-shows recorded between 2010 and 2016 (Chung et al., 2016, &&&1&&&, Afouras et al., 2018). The canonical pipeline consists of the following:

  • Shot boundary detection: Color-histogram comparison between frames segments the continuous broadcasts into shots.
  • Face detection, tracking, and mouth localization: Detectors such as SSD (or HOG/dlib for early datasets) operate framewise, with spatial association for persistent tracking. Facial landmark localization (e.g., regression trees) identifies the mouth region to crop a fixed window (e.g., 120×120 px to 224×224 px).
  • Audio-visual synchronization and speaker verification: Two-stream CNNs are trained to maximize lip–audio synchrony, rejecting off-screen (voice-over) segments and misalignments.
  • Subtitle and transcript alignment: Closed-caption subtitles are extracted and force-aligned to the audio via Penn Phonetics Lab Forced Aligner (P2FA) or Kaldi. IBM Watson ASR checks filter gross alignment errors.
  • Utterance segmentation: Sentences are segmented at punctuation, with length constraints (≤100 characters/≤10 seconds), removing clips with occlusions, low quality, or multiple speakers.

Each clip comes with a ground-truth transcript and metadata including program, date, and temporal information.

2. Dataset Composition and Statistical Properties

The LRS2 and LRS2-BBC datasets contain a large and diverse set of face-cropped videos mapped to natural language sentences:

Split Utterances Word Tokens Vocabulary Size Duration (hours)
Pre-train (LRS2/LRS2-BBC) 96,000 2,000,000 41,000 195
Train/Validation 47,000 337,000 18,000 29
Test 1,243 6,663 1,693 0.5

The dataset supports an open vocabulary (41,000 in train, 1,693 in test) with a pronounced Zipf-like frequency distribution and includes sentence-level, word-aligned, and character-aligned transcripts for rich modeling (Afouras et al., 2018, Afouras et al., 2018). The average sentence length on the test split is approximately $5.36$ words (Fenghour et al., 2020). Supplemental LLM corpora offer over 8 million utterances and 26 million word tokens for external LM training.

3. Preprocessing, Tokenization, and Viseme Mapping

Preprocessing includes grayscale conversion, per-channel normalization, central cropping of the mouth region (typ. 112×112112\times112 px or 120×120120\times120 px), and spatial/temporal data augmentation during training (horizontal flips, shifts, random frame dropout) (Afouras et al., 2018, Afouras et al., 2018). Transcripts are force-aligned to audio at word/character level.

For visual-only models leveraging a viseme mapping, phonetic transcriptions (from the CMU Pronouncing Dictionary) are mapped to 14 viseme classes as per Lee & Yook, comprising:

  • 6 consonant visemes: {p,t,k,ch,f,w}\{p, t, k, ch, f, w\}
  • 7 vowel visemes: {iy,ey,aa,ah,ao,uh,er}\{iy, ey, aa, ah, ao, uh, er\}
  • 1 silent viseme ('s' = sil)

This abstraction is necessary because of the high incidence of homophemes—distinct words sharing indistinguishable viseme sequences—present in LRS2 (Fenghour et al., 2020).

4. Benchmark Splits, Evaluation Metrics, and Model Setup

Data splits: LRS2 provides distinct “pre-train” (short fragments), main “train/val” (full sentence), and test sets, partitioned chronologically with minimal speaker overlap between splits (Chung et al., 2016, Afouras et al., 2018, Afouras et al., 2018).

Recognition metrics (for character-, word-, or viseme-level tasks):

  • NN: total reference tokens
  • SS: substitutions; DD: deletions; II: insertions

ErrorRate=S+D+IN\mathrm{ErrorRate} = \frac{S + D + I}{N}

Sentence accuracy (SAR) is defined as $1$ if the prediction matches the reference exactly, $0$ otherwise (Fenghour et al., 2020).

Performance on LRS2 test split (Fenghour et al., 2020):

Scenario CER (%) WER (%) SAR (%)
Known word boundaries 10.7 18.0 56.8
Unknown word boundaries 36.1 48.3 35.1

Additional error rates for lips-only and AVSR settings are reported in original and related works (Chung et al., 2016).

5. Model Architectures and Decoding Strategies

LRS2 and LRS2-BBC support diverse decoder architectures:

  • CTC-based models: Sequence models with monotonic alignment (labeling aligned to frame sequences); employs prefix-beam search during decoding. External character-level LMs are integrated via log-linear interpolation and length penalty (Afouras et al., 2018, Afouras et al., 2018).
  • Seq2seq Transformer models: Decoder generates predictions autoregressively, optionally with LM fusion during beam search; scoring is formulated as:

score(x,y)=logpAM(yx)+αlogpLM(y)(5+y6)β\text{score}(x, y) = \frac{\log p_{\mathrm{AM}}(y|x) + \alpha \log p_{\mathrm{LM}}(y)}{(\frac{5+|y|}{6})^\beta}

where pAMp_{\mathrm{AM}} is acoustic model likelihood, pLMp_{\mathrm{LM}} LM probability, and (α,β)(\alpha, \beta) are tunable.

  • Viseme-to-word mapping using LLMs: (Fenghour et al., 2020) applies a beam search (# beams = 50) over possible word-sequence combinations constrained by viseme clusters, scoring each sentence using a pre-trained GPT’s perplexity. This black-box LM is not fine-tuned on LRS2 but utilized for resolving the homopheme one-to-many mapping problem inherent in visual speech via perplexity minimization.

6. Qualitative and Quantitative Challenges

LRS2 and its variants pose several formidable challenges for sentence-level lip reading:

  • In-the-wild variability: Clips feature speakers with diverse accents, ages, and identities, with dynamic lighting, pose variation, occlusions, background clutter, and environmental noise (Afouras et al., 2019, Afouras et al., 2018).
  • Long sentences and open vocabulary: Sentences span up to 100 characters with a long-tailed vocabulary, including many rare proper nouns and out-of-vocabulary terms (Afouras et al., 2018, Afouras et al., 2018).
  • Alignment uncertainties: Forced-alignment of subtitles is imperfect; subtitles are often non-verbatim and sentence boundaries are estimated by punctuation (Chung et al., 2016).
  • Homopheme ambiguity: Many distinct words share identical viseme sequences, particularly acute in LRS2, motivating advanced language-model-guided decoding (Fenghour et al., 2020).
  • Word boundary discovery: Error rates sharply increase if word boundaries are unknown, as demonstrated in viseme-based decoding scenarios (e.g., WER rising from 18.0% to 48.3% (Fenghour et al., 2020)).

A plausible implication is that even with large datasets and advanced models, pure visual decoding remains fundamentally limited by the ambiguities of the visual speech channel, regardless of model scale or pre-training.

7. Licensing, Usage, and Limitations

LRS2-BBC is publicly released by the authors, with data and resources available via the provided project websites. The original LRS dataset was under a research-use license; formal licensing details for LRS2 should be confirmed at the dataset project page (Afouras et al., 2018, Chung et al., 2016).

Known dataset limitations include:

  • Domain bias: Largely restricted to British English in broadcast news and interview settings.
  • Speaker demographics: Faces are only those visible in the selected BBC programs, limiting representation.
  • Fixed head pose/crop: Extreme head rotations and off-axis mouth views are excluded.
  • Sentence length truncation: No monologues longer than 100 characters or 10 seconds.
  • Transcript fidelity: Subtitles are not always verbatim and may omit certain disfluencies or colloquialisms.

These properties should be considered when generalizing model performance or deploying models trained on LRS2-like data to other domains.


References

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Lip Reading Sentences (LRS) Dataset.