Multilingual Audio-Visual Speech Recognition Dataset

Updated 2 February 2026

Multilingual AV speech recognition datasets are curated collections that synchronize audio, visual, and textual data for robust multilingual model development.
They employ meticulous alignment of lip movements with audio streams and time-synced transcripts to support applications like lip reading and speech translation.
These datasets drive advances in noise robustness and cross-modal, zero-shot recognition, serving as benchmarks for developing universal AVSR systems.

A multilingual audio-visual speech recognition (AVSR) dataset is a curated collection of synchronized audio, video, and textual transcriptions supporting speech recognition and related tasks across multiple languages. These datasets are foundational for developing, training, and evaluating AVSR models capable of leveraging both auditory and visual information (typically lip/mouth movements) in diverse linguistic and acoustic settings. Recent datasets exhibit considerable scaling in size, language coverage, and the complexity of data collection and alignment pipelines, alongside a trend toward noise robustness and domain generalization.

1. Core Principles, Objectives, and Modalities

Multilingual AVSR datasets are characterized by the following principles:

Audio-Visual Synchronization: Accurate temporal alignment of video (focusing on the lip/mouth region) with the corresponding audio stream per utterance.
Multilingual Coverage: Inclusion of multiple languages—spanning broad language families and resource levels—to facilitate multilingual and cross-lingual system development.
Textual Annotation: Provision of high-quality, time-aligned transcripts in the utterance language, with some datasets also offering translated targets and/or romanized forms.
Environmental Diversity: Sourcing speech "in the wild" (e.g., YouTube, TED) introduces variability in noise, accents, lighting, and backgrounds—intentionally stress-testing robustness.

Data modalities typically include 16 kHz mono audio (log-Mel or raw waveform), high-resolution (e.g., 224×224 px or upsampled) lip/mouth crops at 25-30 frames per second, and UTF-8 or ASCII-encoded transcripts.

2. Major Dataset Initiatives: Composition and Scale

The field is defined by several prominent multilingual AVSR datasets with varying size, annotation protocol, and language breadth:

Dataset	Hours (Total)	Languages (Count)	Annotation	Distinctives
ViSpeR	3,653	5	Auto (Whisper, Seamless-M4T)	Largest non-English AVSR corpora; YouTube “in-the-wild”; “Wild” and “TedX” splits; speakers/gender/dialect not given (Narayan et al., 2024)
MARC	2,916	82	Auto + Human (romanized, grapheme)	LRS3, MuAViC, VoxCeleb2, AVSpeech super-set; romanized & grapheme alignment (Yeo et al., 8 Mar 2025)
MuAViC	1,200	9	Human (+pseudo for ST)	TED/LRS3/mTEDx; 8 non-English plus English; AVSR & AVST splits (Anwar et al., 2023, Rouditchenko et al., 3 Feb 2025)
AVMuST-TED	706	5 (EN→ES/FR/IT/PT)	Human (TED-tier)	Speech translation (Lip/AV); paired EN src + tgt translations (Cheng et al., 2023)
VSR Low Resource	~1,287	4	Auto (Whisper)	Large “label bootstrapping” on VC2/AVSpeech in FR/IT/ES/PT (Yeo et al., 2023)

ViSpeR: Compiles 3,653 hours across Chinese, Spanish, Arabic, French (plus test-only English), harvested via public YouTube with extensive filtering for lip-audio activity to produce 3.0–1.2M utterances per language. No explicit demographic, dialect, or meticulous speaker-label annotation is provided, but scale surpasses previous resources by up to 70× for non-English (Narayan et al., 2024).

MARC: Integrates 2,916 hours spanning 82 languages (11 language families), drawing from LRS3, MuAViC, VoxCeleb2, AVSpeech. Each utterance includes both language-specific grapheme and Romanized text outputs; Romanization employs LLM prompting rather than rigid rules (Yeo et al., 8 Mar 2025).

MuAViC: Comprises ≈1,200 hours, nine languages, with human-aligned utterance-level transcripts (LRS3 for English, mTEDx for others); designed for robustness and translation experiments, with balanced dev/test splits and stress testing under noise (Anwar et al., 2023, Rouditchenko et al., 3 Feb 2025).

AVMuST-TED: 706 hours of English TED/TEDx videos, each paired with native manual translations into Spanish, French, Italian, Portuguese—specialized for speech translation and robust cross-lingual lip reading benchmarks (Cheng et al., 2023).

Low-Resource VSR dataset: Leverages Whisper for high-precision language selection and automatic transcriptions in French, Italian, Spanish, Portuguese, achieving state-of-the-art VSR accuracy with minimal human supervision (Yeo et al., 2023).

3. Data Collection, Preprocessing, and Annotation Protocols

Modern multilingual AVSR corpora adopt highly structured, partially automated pipelines:

Source Acquisition: Mining from YouTube, TED, TEDx—the latter offering high-quality audio/video and (for English/translations) native human-reviewed texts.
Speech and Active Speaker Verification: Initial automatic video pre-filtering (binary classifiers, face detection/tracking, e.g., YOLOv5n0.5-Face, Dlib, ReinaFace), shot segmentation (3D histogram difference, SyncNet) and cross-modal verification ensure mouth ROI and speaker activity match.
Transcript Generation: Primary speech recognition annotation is achieved using automatic ASR (Whisper; Seamless-M4T) for both segment extraction and transcript labeling. For translation-oriented datasets, only TED talks with verified human translation are included.
Segment Alignment: Word-level alignments from ASR outputs (e.g., Whisper) synchronize audio and video; for test splits, dual verification via an additional ASR or MT system (e.g., Seamless-M4T) is used to guarantee transcript fidelity.
Noise, Diversity, and Data Augmentation: “In-the-wild” recordings encompass natural environmental variability; controlled noise is added in experimental protocols (MuAViC) to further increase noise robustness.
Romanization and Language Diversity: LLM-based romanization (MARC) increases linguistic inclusivity and bridges script gaps for cross-lingual or zero-shot schemes (Yeo et al., 8 Mar 2025).

Data selection pipelines additionally incorporate robust language identification via large-scale models (Whisper, MMS-LID-1024), out-of-alphabet filtering, and, where possible, subword tokenization (e.g., SentencePiece; vocabulary sizes up to 21,000 for ViSpeR).

4. Dataset Organization, Splits, and Quality Control

Dataset organization is characterized by clearly defined train/validation/test splits, often aligned to source conventions:

Dataset	Train Hours	Val Hours	Test Hours	#Speakers	Test Sets	Selection Mode
ViSpeR	All except test	—	~6–8 h/lang	Not specified	TedX+Wild	YouTube mined, auto splits
MARC	~80–90%	~5–10%	~5–10%	Inferred from source	Follows LRS3/MuAViC	Stratified, per src corpus
MuAViC	≈1,028	≈57	≈57	≈8,000 (est.)	Standard	TED talk-based
AVMuST-TED	~670	—	~4K clips	Not specified	Random	Speaker/topic not enforced
VSR Low-Resource	1,000+	mTEDx dev/test	mTEDx dev/test	Up to 150,000	mTEDx	Test remains human-labeled

Quality assurance leverages:

Stringent segment filtering and deduplication.
Secondary ASR/MT transcript agreement for test/validation splits.
Character-level and token-level filtering to match language-specific alphabets.
Reconstruction tests (for MARC) to verify romanization fidelity.
Benchmarks against held-out, ground-truth-labeled data for automatic transcriptions, reporting WERs of 7–12% for Whisper auto-labels on mTEDx subsets (Yeo et al., 2023).

5. Linguistic and Technical Breadth

These datasets collectively advance the following axes:

Language Families and Dialect Spread: Coverage ranges from high-resource (EN, ES, FR, ZH, AR, PT, IT, DE, RU, EL) to minor, underrepresented languages (e.g., Swahili, Occitan, Kazakh, Somali in MARC). For MuAViC, the long-tailed resource distribution is intentionally preserved to facilitate cross-lingual learning and noise-robust modeling (Rouditchenko et al., 3 Feb 2025).
Recording Conditions: Unconstrained ("in-the-wild") conditions are prevalent, ensuring diversity in background environments, lighting, and speaker demographics (though explicit statistics are rarely reported for gender/dialect/source bias).
Annotation Modes: Manual (TED/mTEDx), automatic (Whisper, Seamless-M4T, MMS-ASR), and hybrid. Translation datasets (AVMuST-TED, MuAViC) provide human proofreaders’ work; ViSpeR and Low-Resource use only automatic annotation for most languages, with minimal manual test corpus validation.

6. Benchmarking and Downstream Tasks

Key supported AVSR tasks include:

Monolingual and Multilingual AVSR: Augments traditional ASR with visual cues, yielding substantial WER improvements in noise (e.g., ViSpeR: 4.4–15.4% WER/CER on AVSR tasks; MuAViC: –32% WER rel. in noise) (Narayan et al., 2024, Anwar et al., 2023, Rouditchenko et al., 3 Feb 2025).
Visual-Only Speech Recognition (Lip Reading): Assesses vision-alone accuracy; for instance, ViSpeR’s VSR error for Spanish is 39.4% vs. 4.4% for AV.
Audio-Visual Speech Translation: Direct mapping from AV signals to translated text—enabled in AVMuST-TED and MuAViC.
Cross-Modal and Zero-Shot Transfer: MARC enables inference in languages excluded from AV-model training by leveraging romanized representations and LLMs for “de-romanization,” establishing a path for genuinely universal AVSR (Yeo et al., 8 Mar 2025).

Systems are typically evaluated using WER (word error rate), CER (character error rate), and, for translation, BLEU metrics. State-of-the-art models demonstrate that multimodal fusion and multilingual joint training drastically improve robustness, noise tolerance, and transferability in AVSR benchmarks.

7. Distribution, Licensing, and Community Access

All described datasets release scripts, metadata, and splits under research-oriented licenses (e.g., CC BY-SA 4.0, non-commercial). Actual video or audio files derived from platforms such as YouTube remain subject to their original copyright terms; most datasets distribute lists of video IDs with download and preprocessing tools. Opt-out and takedown options (e.g., for ViSpeR, [email protected]) are standard. Data filtering, annotation, and even code for model reproduction are made publicly available via project repositories (Narayan et al., 2024, Yeo et al., 8 Mar 2025, Anwar et al., 2023, Yeo et al., 2023).

Multilingual audio-visual speech recognition datasets form a crucial foundation for research in noise-robust, generalizable, and linguistically equitable speech modeling. The continual expansion in language coverage, data diversity, robust automatic annotation, and novel downstream task enablement signals sustained evolution in this area. This direction enables unified AVSR systems, cross-modal transfer, and eventual universal speech understanding spanning a broad spectrum of the world’s languages (Narayan et al., 2024, Yeo et al., 8 Mar 2025, Anwar et al., 2023, Yeo et al., 2023, Rouditchenko et al., 3 Feb 2025, Cheng et al., 2023).