Papers
Topics
Authors
Recent
Search
2000 character limit reached

TidyVoice Dataset: Multilingual Speaker Verification

Updated 27 January 2026
  • TidyVoice is a large-scale multilingual read-speech dataset that aggregates over 212,000 speakers with rigorous speaker identity cleaning to support ASV research.
  • It employs deep ResNet architectures on standardized 16 kHz WAV audio to extract robust speaker embeddings across 81 languages.
  • The dataset provides extensive evaluation conditions, including Tidy-M and Tidy-X benchmarks, to facilitate fair comparisons in both monolingual and cross-lingual speaker verification.

TidyVoice is a large-scale, multilingual, read-speech dataset curated for automatic speaker verification (ASV) research, derived from the Mozilla Common Voice (MCV) corpus. Through systematic speaker identity cleaning and the design of monolingual and cross-lingual evaluation conditions, TidyVoice addresses the lack of publicly available resources for robust and fair multilingual ASV. The dataset comprises over 212,000 monolingual and approximately 4,500 multilingual speakers across 81 and 40 languages respectively, offering both extensive training material and diverse evaluation trials. Baseline systems leveraging deep ResNet architectures highlight the improved performance and generalization enabled by TidyVoice, setting new standards for multilingual read-speech verification (Farhadipour et al., 22 Jan 2026).

1. Motivation and Context

Prevailing benchmarks for ASV, such as VoxCeleb and NIST SRE, predominantly feature in-the-wild spontaneous speech, often in English, and may be restricted by access or licensing costs. Real-world applications—particularly anti-spoofing and text-prompted authentication—demand large-scale, multilingual corpora of read speech with verifiable speaker identity. The TidyVoice dataset was created to fulfill this requirement, offering a publicly available, multilingual, and large-scale read-speech resource with high-integrity speaker labeling, suitable for training and evaluating ASV systems in multiple languages (Farhadipour et al., 22 Jan 2026).

2. Data Curation and Speaker Identity Cleaning

TidyVoice is sourced from the Mozilla Common Voice corpus, a crowd-sourced project where recordings are annotated with "client IDs" as proxies for speaker identity. Prior analyses demonstrated significant speaker heterogeneity: client IDs may aggregate recordings from multiple, distinct speakers, undermining downstream ASV reliability.

To mitigate this, TidyVoice employs a verification-based cleaning protocol:

  • A ResNet-293 model, pre-trained on VoxBlink2 and VoxCeleb2, is used to extract speaker embeddings.
  • Within each monolingual client ID, cosine similarities between a designated enrollment utterance and all others are computed. Recordings with similarity scores below 0.4 are excluded, as they are likely from different speakers.
  • For multilingual client IDs (spanning multiple languages), cross-language cosine similarities are computed; any client ID exhibiting substantial dissimilarity (many scores < 0.2) is removed entirely (433 out of 4,907 IDs, approximately 9%).
  • Only speakers with at least five valid utterances are retained. All audio is trimmed, normalized, and standardized to 16 kHz WAV format (Farhadipour et al., 22 Jan 2026).

3. Dataset Composition and Evaluation Conditions

TidyVoice is partitioned into two primary evaluation conditions:

Condition #Speakers (Train/Test) #Languages #Utterances (Train/Test) Duration (h, Train/Test)
Tidy-M 141,623 / 70,994 81 5.4 M / 218 K 7,800 / 350
Tidy-X 3,666 / 808 40 262 K / 60 K 370 / 87

Tidy-M (Monolingual Condition)

  • Comprises 212,617 monolingual speakers over 81 languages (train: 141,623; test: 70,994).
  • Contains 5.6 million utterances (train: 5.4M, test: 218K).
  • Test set consists of 2.8 million pre-defined trials, split equally into target (same speaker, same language) and non-target (different speaker, same language) pairs.

Tidy-X (Cross-lingual Condition)

  • Encompasses 4,474 multilingual speakers (each present in at least two languages) from 40 languages (train: 3,666; test: 808).
  • Contains 321,711 utterances (train: 262K, test: 60K).
  • Test condition includes 12 million trials: 2M target (same speaker, same language), 2M target (same speaker, cross-language), 4M non-target (different speaker, same language), and 4M non-target (different speaker, cross-language) (Farhadipour et al., 22 Jan 2026).

4. Evaluation Protocols and Baseline Results

Trials are pre-defined enrollment/test utterance pairs. Target trials involve the same speaker (with same/cross language as appropriate), while non-target trials pair different speakers (also stratified by language).

Metrics

  • Equal Error Rate (EER): The rate at threshold τ\tau for which the false-accept rate (FAR) equals the false-reject rate (FRR):

EER=minτFAR(τ)FRR(τ)\mathrm{EER} = \min_{\tau}\bigl|\mathrm{FAR}(\tau) - \mathrm{FRR}(\tau)\bigr|

  • Minimum Detection Cost Function (minDCF): Assessed at Ptarget=0.01P_\mathrm{target} = 0.01, Cmiss=1C_\mathrm{miss} = 1, Cfa=1C_\mathrm{fa} = 1.

Baseline System Performance

Model Pre-training Tidy-M Fine-tune Tidy-X Fine-tune Tidy-M EER (%) Tidy-X EER (%)
ResNet-34 (VB2) VoxBlink2 3.34 4.10
ResNet-34 (VB2+Tidy-M) VoxBlink2 0.67 1.90
ResNet-293 (VB2) VoxBlink2 2.74 3.89
ResNet-293 (VB2+Tidy-M) VoxBlink2 0.35 1.65

Generalization Study

Out-of-domain evaluation on English conversational speech (CANDOR corpus, 93K trials):

  • ResNet-34 (VB2): 2.81% EER \rightarrow 2.00% EER post Tidy-M fine-tuning.
  • ResNet-293 (VB2): 2.97% EER \rightarrow 1.60% EER post Tidy-M fine-tuning.

These findings establish that fine-tuning on Tidy-M read-speech data significantly boosts model robustness for both in-domain and spontaneous conversational speech scenarios (Farhadipour et al., 22 Jan 2026).

5. Distribution, Licensing, and Resource Access

TidyVoice provides complete access to:

  • All standardized 16 kHz WAV audio files
  • Speaker metadata and cleaned client IDs
  • Predefined trial lists for Tidy-M and Tidy-X evaluation conditions
  • Pre-trained and fine-tuned ResNet-34 and ResNet-293 model checkpoints

Resources are distributed under a permissive open-source license, facilitating reproducibility and further research. Data and code are available at: https://github.com/areffarhadi/wespeaker/tree/master/examples/tidyvocie (Farhadipour et al., 22 Jan 2026).

6. Research Applications and Limitations

Applications

  • Multilingual speaker-verification system training and evaluation in read-speech contexts
  • Anti-spoofing and text-prompted verification benchmarking
  • Comprehensive investigation of intra- and cross-lingual speaker verification
  • Domain adaptation and fairness analyses across the 81 languages represented

Known Limitations

  • The cleaning process may not eliminate all speaker swaps; residual heterogeneity remains plausible.
  • Imbalances in speaker and utterance counts per language can affect statistical reliability for less represented languages.
  • Read-speech style may diverge acoustically and prosodically from spontaneous or telephony speech, potentially limiting real-world generalization.
  • Demographic and recording-condition biases are inherited from Common Voice contributors (Farhadipour et al., 22 Jan 2026).

7. Significance and Outlook

TidyVoice represents a critical advancement for the multilingual ASV research community by addressing the scarcity of curated, large-scale, read-speech datasets with clean speaker labeling. Its design facilitates rigorous benchmarking, supports the development of robust and equitable ASV technologies, and fosters future research on cross-lingual speaker identity analysis and domain transfer. A plausible implication is that, by enabling fair comparisons across a broad range of languages and evaluation conditions, TidyVoice will catalyze innovation in both model architecture and evaluation methodology for speaker recognition (Farhadipour et al., 22 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TidyVoice Dataset.