DiffSSD Dataset for Synthetic Speech Forensics
- DiffSSD is a benchmark corpus featuring nearly 200 hours of speech, balancing bona fide and advanced synthetic samples from diffusion-based TTS and VC systems.
- It aggregates outputs from 10 diffusion-based synthesizers, including eight open-source and two commercial generators, ensuring diverse, high-quality synthetic audio.
- The dataset enables robust forensic evaluation by demonstrating that retraining detectors on DiffSSD significantly reduces Equal Error Rates and improves detection of unseen synthesis methods.
DiffSSD (Diffusion-Based Synthetic Speech Dataset) is a benchmark corpus designed to facilitate the forensic analysis and detection of high-fidelity synthetic speech generated by state-of-the-art diffusion-based text-to-speech (TTS) and voice conversion (VC) technologies. Released in 2024, DiffSSD addresses a critical limitation of existing datasets, which predominantly feature synthetic speech produced by “conventional” models (e.g., RNNs, HMMs, GANs) and thereby fail to capture the subtle artifacts and exceptional audio quality characteristic of recent diffusion-based and commercial generators. DiffSSD comprises nearly 200 hours of English-language speech, balancing bona fide utterances and outputs from eight open-source and two commercial diffusion models, with rigorous train/validation/test splits for both closed-set and open-set research scenarios (Bhagtani et al., 2024).
1. Corpus Composition and Properties
DiffSSD contains a total of 94,226 utterances summing to 196.04 hours, with synthetic examples constituting approximately 74.3% of the dataset and bona fide (real) speech occupying about 25.7%. Real speech is drawn from 74 unique speakers sampled from LibriSpeech and LJ-Speech, encompassing both U.S. and U.K. accents. Synthetic utterances are generated using 11 distinct speaker profiles: five zero-shot voices from LibriSpeech, one single LJ-Speech speaker for all pre-trained methods, plus additional commercial profiles.
Data splits are structured as follows:
| Split | Real Utterances | Synthetic Utterances | Total Files |
|---|---|---|---|
| Training | 9,690 | 22,000 | 31,690 |
| Validation | 2,423 | 5,500 | 7,423 |
| Testing | 12,113 | 42,500 | 54,613 |
The average utterance length is 7.49 seconds (range: ≈1–15 sec), with a roughly normal distribution centered near 7.5 sec. The synthetic subset maintains speaker gender balance for zero-shot profiles (5 female, 5 male).
2. Included Diffusion-Based Speech Generators
DiffSSD aggregates synthetic speech from ten diffusion-based systems, comprising eight open-source and two commercial generators. Each generator’s contribution is detailed by license type, mode (“zero-shot” or “pre-trained”), and sampling rate:
| Generator | License/Mode | Sampling Rate | Utterances | Split Contribution |
|---|---|---|---|---|
| ElevenLabs | Commercial, ZS | 44.1 kHz | 5,000 | 2,000 train; 500 val; 2,500 test |
| GradTTS | Open-source, PT | 22.05 kHz | 5,000 | All splits |
| OpenVoice2 | Open-source, ZS | 22.05 kHz | 25,000 | All splits |
| ProDiff | Open-source, PT | 22.05 kHz | 5,000 | All splits |
| WaveGrad2 | Open-source, PT | 22.05 kHz | 5,000 | All splits |
| XttsV2 | Open-source, ZS | 24 kHz | 5,000 | All splits |
| YourTTS | Open-source, ZS | 16 kHz | 5,000 | All splits |
| DiffGAN-TTS | Open-source, PT | 22.05 kHz | 5,000 | Testing only |
| PlayHT | Commercial, ZS | 24 kHz | 5,000 | Testing only |
| UnitSpeech | Open-source, ZS | 22.05 kHz | 5,000 | Testing only |
OpenVoice2 is the largest contributor (≈25,000 utterances), followed by the set of LJ-Speech-based pre-trained methods.
3. Dataset Statistics and Distributions
DiffSSD features 145.2 hours of synthetic speech and 50.8 hours of real speech. The utterance-length distribution is approximately normal, centered at 7.5 seconds, with most utterances between 2 and 15 seconds. Within the synthetic subset, generator contributions differ markedly, reflecting coverage of both the dominant open-source and emerging commercial platforms. Statistical analysis confirms balanced gender representation among synthetic zero-shot profiles.
4. Detector Evaluation Protocols and Metrics
Synthetic speech detector evaluation in DiffSSD utilizes Equal Error Rate (EER) as the principal metric:
- Where and
- : false acceptances (synthetic classified as real), : false rejections (real classified as synthetic), : true rejections, : true correct reals.
Lower EER values signal superior separation between real and synthetic speech. The dataset mentions the tandem detection cost function (t-DCF) but does not provide formula details.
5. Cross-Dataset Generalization and Closed-Set vs. Open-Set Experiments
Evaluation is structured along two principal regimes:
a) Generalization Test (ASVspoof2019-trained detectors, tested on DiffSSD):
- Detectors: LFCC-GMM (handcrafted), MFCC-ResNet, Spec-ResNet, PaSST (patch-based transformer), Wav2Vec2 (self-supervised transformer).
- Observed performance deterioration: EER rises from near-0–10% (ASVspoof2019) to 22–55% on DiffSSD. For example, LFCC-GMM increases from 3.7% to 36.7% EER, Wav2Vec2 from 0.3% to 48.5%, indicating poor generalization to high-fidelity diffusion-based synthetic speech.
b) In-Domain Test (trained and evaluated on DiffSSD splits):
- Retraining detectors on DiffSSD restores accuracy: PaSST achieves 0.08% EER on validation, 3.5% on test; Wav2Vec2 reaches 1.5% (val), 3.0% (test).
- Closed-set (seen generators): 99–100% per-generator accuracy at EER threshold with PaSST.
- Open-set (unseen generators, including ElevenLabs and PlayHT): >95% accuracy at EER threshold.
Handcrafted features (LFCC-GMM) consistently lag behind deep-feature architectures (ResNet, transformers), especially for high-fidelity synthetic speech.
6. Significance and Recommendations for Future Research
DiffSSD demonstrates that detectors trained solely on conventional synthetic speech are ineffective for diffusion-based outputs. Inclusion of recent open-source and commercial diffusion syntheses in training sets is imperative for robust model performance. The dataset’s structure enables both closed-set and open-set evaluation, simulating real-world forensic scenarios. Recommendations include:
- Routine incorporation of contemporary diffusion-based synthetic speech during training and validation to mitigate overfitting to legacy artifacts.
- Multilingual and cross-dialect expansion beyond the current English-only coverage.
- Investigation of advanced front-ends (learned spectral priors, phase-aware features) to improve detection of commercial-grade synthetic speech.
- Use and public sharing of reproducible splits, text prompts, and generation logs (available via HuggingFace).
- Development of ensemble and confidence-calibrated detectors to address ongoing advances in speech synthesis.
- Ongoing adjustment of datasets in response to anti-detection mechanisms, such as smoothing or watermarking, adopted by commercial TTS platforms.
DiffSSD establishes a critical benchmark for synthetic speech forensics and future-proofs research against the rapidly evolving landscape of high-fidelity speech generation (Bhagtani et al., 2024).