DiffSSD Dataset for Synthetic Speech Forensics

Updated 13 January 2026

DiffSSD is a benchmark corpus featuring nearly 200 hours of speech, balancing bona fide and advanced synthetic samples from diffusion-based TTS and VC systems.
It aggregates outputs from 10 diffusion-based synthesizers, including eight open-source and two commercial generators, ensuring diverse, high-quality synthetic audio.
The dataset enables robust forensic evaluation by demonstrating that retraining detectors on DiffSSD significantly reduces Equal Error Rates and improves detection of unseen synthesis methods.

DiffSSD (Diffusion-Based Synthetic Speech Dataset) is a benchmark corpus designed to facilitate the forensic analysis and detection of high-fidelity synthetic speech generated by state-of-the-art diffusion-based text-to-speech (TTS) and voice conversion (VC) technologies. Released in 2024, DiffSSD addresses a critical limitation of existing datasets, which predominantly feature synthetic speech produced by “conventional” models (e.g., RNNs, HMMs, GANs) and thereby fail to capture the subtle artifacts and exceptional audio quality characteristic of recent diffusion-based and commercial generators. DiffSSD comprises nearly 200 hours of English-language speech, balancing bona fide utterances and outputs from eight open-source and two commercial diffusion models, with rigorous train/validation/test splits for both closed-set and open-set research scenarios (Bhagtani et al., 2024).

1. Corpus Composition and Properties

DiffSSD contains a total of 94,226 utterances summing to 196.04 hours, with synthetic examples constituting approximately 74.3% of the dataset and bona fide (real) speech occupying about 25.7%. Real speech is drawn from 74 unique speakers sampled from LibriSpeech and LJ-Speech, encompassing both U.S. and U.K. accents. Synthetic utterances are generated using 11 distinct speaker profiles: five zero-shot voices from LibriSpeech, one single LJ-Speech speaker for all pre-trained methods, plus additional commercial profiles.

Data splits are structured as follows:

Split	Real Utterances	Synthetic Utterances	Total Files
Training	9,690	22,000	31,690
Validation	2,423	5,500	7,423
Testing	12,113	42,500	54,613

The average utterance length is 7.49 seconds (range: ≈1–15 sec), with a roughly normal distribution centered near 7.5 sec. The synthetic subset maintains speaker gender balance for zero-shot profiles (5 female, 5 male).

2. Included Diffusion-Based Speech Generators

DiffSSD aggregates synthetic speech from ten diffusion-based systems, comprising eight open-source and two commercial generators. Each generator’s contribution is detailed by license type, mode (“zero-shot” or “pre-trained”), and sampling rate:

Generator	License/Mode	Sampling Rate	Utterances	Split Contribution
ElevenLabs	Commercial, ZS	44.1 kHz	5,000	2,000 train; 500 val; 2,500 test
GradTTS	Open-source, PT	22.05 kHz	5,000	All splits
OpenVoice2	Open-source, ZS	22.05 kHz	25,000	All splits
ProDiff	Open-source, PT	22.05 kHz	5,000	All splits
WaveGrad2	Open-source, PT	22.05 kHz	5,000	All splits
XttsV2	Open-source, ZS	24 kHz	5,000	All splits
YourTTS	Open-source, ZS	16 kHz	5,000	All splits
DiffGAN-TTS	Open-source, PT	22.05 kHz	5,000	Testing only
PlayHT	Commercial, ZS	24 kHz	5,000	Testing only
UnitSpeech	Open-source, ZS	22.05 kHz	5,000	Testing only

OpenVoice2 is the largest contributor (≈25,000 utterances), followed by the set of LJ-Speech-based pre-trained methods.

3. Dataset Statistics and Distributions

DiffSSD features 145.2 hours of synthetic speech and 50.8 hours of real speech. The utterance-length distribution is approximately normal, centered at 7.5 seconds, with most utterances between 2 and 15 seconds. Within the synthetic subset, generator contributions differ markedly, reflecting coverage of both the dominant open-source and emerging commercial platforms. Statistical analysis confirms balanced gender representation among synthetic zero-shot profiles.

4. Detector Evaluation Protocols and Metrics

Synthetic speech detector evaluation in DiffSSD utilizes Equal Error Rate (EER) as the principal metric:

$\text{EER} = \{\tau \,|\, \text{FAR}(\tau) = \text{FRR}(\tau)\}$
Where $\text{FAR}(\tau) = \text{FA}(\tau) / [\text{FA}(\tau) + \text{TR}]$ and $\text{FRR}(\tau) = \text{FR}(\tau) / [\text{FR}(\tau) + \text{TC}]$
$\text{FA}$ : false acceptances (synthetic classified as real), $\text{FR}$ : false rejections (real classified as synthetic), $\text{TR}$ : true rejections, $\text{TC}$ : true correct reals.

Lower EER values signal superior separation between real and synthetic speech. The dataset mentions the tandem detection cost function (t-DCF) but does not provide formula details.

5. Cross-Dataset Generalization and Closed-Set vs. Open-Set Experiments

Evaluation is structured along two principal regimes:

a) Generalization Test (ASVspoof2019-trained detectors, tested on DiffSSD):

Detectors: LFCC-GMM (handcrafted), MFCC-ResNet, Spec-ResNet, PaSST (patch-based transformer), Wav2Vec2 (self-supervised transformer).
Observed performance deterioration: EER rises from near-0–10% (ASVspoof2019) to 22–55% on DiffSSD. For example, LFCC-GMM increases from 3.7% to 36.7% EER, Wav2Vec2 from 0.3% to 48.5%, indicating poor generalization to high-fidelity diffusion-based synthetic speech.

b) In-Domain Test (trained and evaluated on DiffSSD splits):

Retraining detectors on DiffSSD restores accuracy: PaSST achieves 0.08% EER on validation, 3.5% on test; Wav2Vec2 reaches 1.5% (val), 3.0% (test).
Closed-set (seen generators): 99–100% per-generator accuracy at EER threshold with PaSST.
Open-set (unseen generators, including ElevenLabs and PlayHT): >95% accuracy at EER threshold.

Handcrafted features (LFCC-GMM) consistently lag behind deep-feature architectures (ResNet, transformers), especially for high-fidelity synthetic speech.

6. Significance and Recommendations for Future Research

DiffSSD demonstrates that detectors trained solely on conventional synthetic speech are ineffective for diffusion-based outputs. Inclusion of recent open-source and commercial diffusion syntheses in training sets is imperative for robust model performance. The dataset’s structure enables both closed-set and open-set evaluation, simulating real-world forensic scenarios. Recommendations include:

Routine incorporation of contemporary diffusion-based synthetic speech during training and validation to mitigate overfitting to legacy artifacts.
Multilingual and cross-dialect expansion beyond the current English-only coverage.
Investigation of advanced front-ends (learned spectral priors, phase-aware features) to improve detection of commercial-grade synthetic speech.
Use and public sharing of reproducible splits, text prompts, and generation logs (available via HuggingFace).
Development of ensemble and confidence-calibrated detectors to address ongoing advances in speech synthesis.
Ongoing adjustment of datasets in response to anti-detection mechanisms, such as smoothing or watermarking, adopted by commercial TTS platforms.

DiffSSD establishes a critical benchmark for synthetic speech forensics and future-proofs research against the rapidly evolving landscape of high-fidelity speech generation (Bhagtani et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

DiffSSD: A Diffusion-Based Dataset For Speech Forensics (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DiffSSD Dataset.