TidyVoice Dataset: Multilingual Speaker Verification

Updated 27 January 2026

TidyVoice is a large-scale multilingual read-speech dataset that aggregates over 212,000 speakers with rigorous speaker identity cleaning to support ASV research.
It employs deep ResNet architectures on standardized 16 kHz WAV audio to extract robust speaker embeddings across 81 languages.
The dataset provides extensive evaluation conditions, including Tidy-M and Tidy-X benchmarks, to facilitate fair comparisons in both monolingual and cross-lingual speaker verification.

TidyVoice is a large-scale, multilingual, read-speech dataset curated for automatic speaker verification (ASV) research, derived from the Mozilla Common Voice (MCV) corpus. Through systematic speaker identity cleaning and the design of monolingual and cross-lingual evaluation conditions, TidyVoice addresses the lack of publicly available resources for robust and fair multilingual ASV. The dataset comprises over 212,000 monolingual and approximately 4,500 multilingual speakers across 81 and 40 languages respectively, offering both extensive training material and diverse evaluation trials. Baseline systems leveraging deep ResNet architectures highlight the improved performance and generalization enabled by TidyVoice, setting new standards for multilingual read-speech verification (Farhadipour et al., 22 Jan 2026).

1. Motivation and Context

Prevailing benchmarks for ASV, such as VoxCeleb and NIST SRE, predominantly feature in-the-wild spontaneous speech, often in English, and may be restricted by access or licensing costs. Real-world applications—particularly anti-spoofing and text-prompted authentication—demand large-scale, multilingual corpora of read speech with verifiable speaker identity. The TidyVoice dataset was created to fulfill this requirement, offering a publicly available, multilingual, and large-scale read-speech resource with high-integrity speaker labeling, suitable for training and evaluating ASV systems in multiple languages (Farhadipour et al., 22 Jan 2026).

2. Data Curation and Speaker Identity Cleaning

TidyVoice is sourced from the Mozilla Common Voice corpus, a crowd-sourced project where recordings are annotated with "client IDs" as proxies for speaker identity. Prior analyses demonstrated significant speaker heterogeneity: client IDs may aggregate recordings from multiple, distinct speakers, undermining downstream ASV reliability.

To mitigate this, TidyVoice employs a verification-based cleaning protocol:

A ResNet-293 model, pre-trained on VoxBlink2 and VoxCeleb2, is used to extract speaker embeddings.
Within each monolingual client ID, cosine similarities between a designated enrollment utterance and all others are computed. Recordings with similarity scores below 0.4 are excluded, as they are likely from different speakers.
For multilingual client IDs (spanning multiple languages), cross-language cosine similarities are computed; any client ID exhibiting substantial dissimilarity (many scores < 0.2) is removed entirely (433 out of 4,907 IDs, approximately 9%).
Only speakers with at least five valid utterances are retained. All audio is trimmed, normalized, and standardized to 16 kHz WAV format (Farhadipour et al., 22 Jan 2026).

3. Dataset Composition and Evaluation Conditions

TidyVoice is partitioned into two primary evaluation conditions:

Condition	#Speakers (Train/Test)	#Languages	#Utterances (Train/Test)	Duration (h, Train/Test)
Tidy-M	141,623 / 70,994	81	5.4 M / 218 K	7,800 / 350
Tidy-X	3,666 / 808	40	262 K / 60 K	370 / 87

Tidy-M (Monolingual Condition)

Comprises 212,617 monolingual speakers over 81 languages (train: 141,623; test: 70,994).
Contains 5.6 million utterances (train: 5.4M, test: 218K).
Test set consists of 2.8 million pre-defined trials, split equally into target (same speaker, same language) and non-target (different speaker, same language) pairs.

Tidy-X (Cross-lingual Condition)

Encompasses 4,474 multilingual speakers (each present in at least two languages) from 40 languages (train: 3,666; test: 808).
Contains 321,711 utterances (train: 262K, test: 60K).
Test condition includes 12 million trials: 2M target (same speaker, same language), 2M target (same speaker, cross-language), 4M non-target (different speaker, same language), and 4M non-target (different speaker, cross-language) (Farhadipour et al., 22 Jan 2026).

4. Evaluation Protocols and Baseline Results

Trials are pre-defined enrollment/test utterance pairs. Target trials involve the same speaker (with same/cross language as appropriate), while non-target trials pair different speakers (also stratified by language).

Metrics

Equal Error Rate (EER): The rate at threshold $\tau$ for which the false-accept rate (FAR) equals the false-reject rate (FRR):

$\mathrm{EER} = \min_{\tau}\bigl|\mathrm{FAR}(\tau) - \mathrm{FRR}(\tau)\bigr|$

Minimum Detection Cost Function (minDCF): Assessed at $P_\mathrm{target} = 0.01$ , $C_\mathrm{miss} = 1$ , $C_\mathrm{fa} = 1$ .

Baseline System Performance

Model	Pre-training	Tidy-M Fine-tune	Tidy-X Fine-tune	Tidy-M EER (%)	Tidy-X EER (%)
ResNet-34 (VB2)	VoxBlink2	–	–	3.34	4.10
ResNet-34 (VB2+Tidy-M)	VoxBlink2	✓	–	0.67	1.90
ResNet-293 (VB2)	VoxBlink2	–	–	2.74	3.89
ResNet-293 (VB2+Tidy-M)	VoxBlink2	✓	–	0.35	1.65

Generalization Study

Out-of-domain evaluation on English conversational speech (CANDOR corpus, 93K trials):

ResNet-34 (VB2): 2.81% EER $\rightarrow$ 2.00% EER post Tidy-M fine-tuning.
ResNet-293 (VB2): 2.97% EER $\rightarrow$ 1.60% EER post Tidy-M fine-tuning.

These findings establish that fine-tuning on Tidy-M read-speech data significantly boosts model robustness for both in-domain and spontaneous conversational speech scenarios (Farhadipour et al., 22 Jan 2026).

5. Distribution, Licensing, and Resource Access

TidyVoice provides complete access to:

All standardized 16 kHz WAV audio files
Speaker metadata and cleaned client IDs
Predefined trial lists for Tidy-M and Tidy-X evaluation conditions
Pre-trained and fine-tuned ResNet-34 and ResNet-293 model checkpoints

Resources are distributed under a permissive open-source license, facilitating reproducibility and further research. Data and code are available at: https://github.com/areffarhadi/wespeaker/tree/master/examples/tidyvocie (Farhadipour et al., 22 Jan 2026).

6. Research Applications and Limitations

Applications

Multilingual speaker-verification system training and evaluation in read-speech contexts
Anti-spoofing and text-prompted verification benchmarking
Comprehensive investigation of intra- and cross-lingual speaker verification
Domain adaptation and fairness analyses across the 81 languages represented

Known Limitations

The cleaning process may not eliminate all speaker swaps; residual heterogeneity remains plausible.
Imbalances in speaker and utterance counts per language can affect statistical reliability for less represented languages.
Read-speech style may diverge acoustically and prosodically from spontaneous or telephony speech, potentially limiting real-world generalization.
Demographic and recording-condition biases are inherited from Common Voice contributors (Farhadipour et al., 22 Jan 2026).

7. Significance and Outlook

TidyVoice represents a critical advancement for the multilingual ASV research community by addressing the scarcity of curated, large-scale, read-speech datasets with clean speaker labeling. Its design facilitates rigorous benchmarking, supports the development of robust and equitable ASV technologies, and fosters future research on cross-lingual speaker identity analysis and domain transfer. A plausible implication is that, by enabling fair comparisons across a broad range of languages and evaluation conditions, TidyVoice will catalyze innovation in both model architecture and evaluation methodology for speaker recognition (Farhadipour et al., 22 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

TidyVoice: A Curated Multilingual Dataset for Speaker Verification Derived from Common Voice (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TidyVoice Dataset.