Papers
Topics
Authors
Recent
Search
2000 character limit reached

DiffSSD Dataset for Synthetic Speech Forensics

Updated 13 January 2026
  • DiffSSD is a benchmark corpus featuring nearly 200 hours of speech, balancing bona fide and advanced synthetic samples from diffusion-based TTS and VC systems.
  • It aggregates outputs from 10 diffusion-based synthesizers, including eight open-source and two commercial generators, ensuring diverse, high-quality synthetic audio.
  • The dataset enables robust forensic evaluation by demonstrating that retraining detectors on DiffSSD significantly reduces Equal Error Rates and improves detection of unseen synthesis methods.

DiffSSD (Diffusion-Based Synthetic Speech Dataset) is a benchmark corpus designed to facilitate the forensic analysis and detection of high-fidelity synthetic speech generated by state-of-the-art diffusion-based text-to-speech (TTS) and voice conversion (VC) technologies. Released in 2024, DiffSSD addresses a critical limitation of existing datasets, which predominantly feature synthetic speech produced by “conventional” models (e.g., RNNs, HMMs, GANs) and thereby fail to capture the subtle artifacts and exceptional audio quality characteristic of recent diffusion-based and commercial generators. DiffSSD comprises nearly 200 hours of English-language speech, balancing bona fide utterances and outputs from eight open-source and two commercial diffusion models, with rigorous train/validation/test splits for both closed-set and open-set research scenarios (Bhagtani et al., 2024).

1. Corpus Composition and Properties

DiffSSD contains a total of 94,226 utterances summing to 196.04 hours, with synthetic examples constituting approximately 74.3% of the dataset and bona fide (real) speech occupying about 25.7%. Real speech is drawn from 74 unique speakers sampled from LibriSpeech and LJ-Speech, encompassing both U.S. and U.K. accents. Synthetic utterances are generated using 11 distinct speaker profiles: five zero-shot voices from LibriSpeech, one single LJ-Speech speaker for all pre-trained methods, plus additional commercial profiles.

Data splits are structured as follows:

Split Real Utterances Synthetic Utterances Total Files
Training 9,690 22,000 31,690
Validation 2,423 5,500 7,423
Testing 12,113 42,500 54,613

The average utterance length is 7.49 seconds (range: ≈1–15 sec), with a roughly normal distribution centered near 7.5 sec. The synthetic subset maintains speaker gender balance for zero-shot profiles (5 female, 5 male).

2. Included Diffusion-Based Speech Generators

DiffSSD aggregates synthetic speech from ten diffusion-based systems, comprising eight open-source and two commercial generators. Each generator’s contribution is detailed by license type, mode (“zero-shot” or “pre-trained”), and sampling rate:

Generator License/Mode Sampling Rate Utterances Split Contribution
ElevenLabs Commercial, ZS 44.1 kHz 5,000 2,000 train; 500 val; 2,500 test
GradTTS Open-source, PT 22.05 kHz 5,000 All splits
OpenVoice2 Open-source, ZS 22.05 kHz 25,000 All splits
ProDiff Open-source, PT 22.05 kHz 5,000 All splits
WaveGrad2 Open-source, PT 22.05 kHz 5,000 All splits
XttsV2 Open-source, ZS 24 kHz 5,000 All splits
YourTTS Open-source, ZS 16 kHz 5,000 All splits
DiffGAN-TTS Open-source, PT 22.05 kHz 5,000 Testing only
PlayHT Commercial, ZS 24 kHz 5,000 Testing only
UnitSpeech Open-source, ZS 22.05 kHz 5,000 Testing only

OpenVoice2 is the largest contributor (≈25,000 utterances), followed by the set of LJ-Speech-based pre-trained methods.

3. Dataset Statistics and Distributions

DiffSSD features 145.2 hours of synthetic speech and 50.8 hours of real speech. The utterance-length distribution is approximately normal, centered at 7.5 seconds, with most utterances between 2 and 15 seconds. Within the synthetic subset, generator contributions differ markedly, reflecting coverage of both the dominant open-source and emerging commercial platforms. Statistical analysis confirms balanced gender representation among synthetic zero-shot profiles.

4. Detector Evaluation Protocols and Metrics

Synthetic speech detector evaluation in DiffSSD utilizes Equal Error Rate (EER) as the principal metric:

  • EER={τFAR(τ)=FRR(τ)}\text{EER} = \{\tau \,|\, \text{FAR}(\tau) = \text{FRR}(\tau)\}
  • Where FAR(τ)=FA(τ)/[FA(τ)+TR]\text{FAR}(\tau) = \text{FA}(\tau) / [\text{FA}(\tau) + \text{TR}] and FRR(τ)=FR(τ)/[FR(τ)+TC]\text{FRR}(\tau) = \text{FR}(\tau) / [\text{FR}(\tau) + \text{TC}]
  • FA\text{FA}: false acceptances (synthetic classified as real), FR\text{FR}: false rejections (real classified as synthetic), TR\text{TR}: true rejections, TC\text{TC}: true correct reals.

Lower EER values signal superior separation between real and synthetic speech. The dataset mentions the tandem detection cost function (t-DCF) but does not provide formula details.

5. Cross-Dataset Generalization and Closed-Set vs. Open-Set Experiments

Evaluation is structured along two principal regimes:

a) Generalization Test (ASVspoof2019-trained detectors, tested on DiffSSD):

  • Detectors: LFCC-GMM (handcrafted), MFCC-ResNet, Spec-ResNet, PaSST (patch-based transformer), Wav2Vec2 (self-supervised transformer).
  • Observed performance deterioration: EER rises from near-0–10% (ASVspoof2019) to 22–55% on DiffSSD. For example, LFCC-GMM increases from 3.7% to 36.7% EER, Wav2Vec2 from 0.3% to 48.5%, indicating poor generalization to high-fidelity diffusion-based synthetic speech.

b) In-Domain Test (trained and evaluated on DiffSSD splits):

  • Retraining detectors on DiffSSD restores accuracy: PaSST achieves 0.08% EER on validation, 3.5% on test; Wav2Vec2 reaches 1.5% (val), 3.0% (test).
  • Closed-set (seen generators): 99–100% per-generator accuracy at EER threshold with PaSST.
  • Open-set (unseen generators, including ElevenLabs and PlayHT): >95% accuracy at EER threshold.

Handcrafted features (LFCC-GMM) consistently lag behind deep-feature architectures (ResNet, transformers), especially for high-fidelity synthetic speech.

6. Significance and Recommendations for Future Research

DiffSSD demonstrates that detectors trained solely on conventional synthetic speech are ineffective for diffusion-based outputs. Inclusion of recent open-source and commercial diffusion syntheses in training sets is imperative for robust model performance. The dataset’s structure enables both closed-set and open-set evaluation, simulating real-world forensic scenarios. Recommendations include:

  • Routine incorporation of contemporary diffusion-based synthetic speech during training and validation to mitigate overfitting to legacy artifacts.
  • Multilingual and cross-dialect expansion beyond the current English-only coverage.
  • Investigation of advanced front-ends (learned spectral priors, phase-aware features) to improve detection of commercial-grade synthetic speech.
  • Use and public sharing of reproducible splits, text prompts, and generation logs (available via HuggingFace).
  • Development of ensemble and confidence-calibrated detectors to address ongoing advances in speech synthesis.
  • Ongoing adjustment of datasets in response to anti-detection mechanisms, such as smoothing or watermarking, adopted by commercial TTS platforms.

DiffSSD establishes a critical benchmark for synthetic speech forensics and future-proofs research against the rapidly evolving landscape of high-fidelity speech generation (Bhagtani et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DiffSSD Dataset.