Papers
Topics
Authors
Recent
Search
2000 character limit reached

ESDD2: Speech & Sound Deepfake Challenge

Updated 19 January 2026
  • ESDD2 is a benchmarking challenge addressing AI-manipulated speech and environmental sound deepfakes under realistic and adversarial conditions.
  • It leverages diverse datasets and protocols to simulate independent and joint spoofing scenarios, evaluating systems with metrics like EER and Macro-F1.
  • Advanced models in ESDD2 employ mid-layer fusion, attentive pooling, and class imbalance correction to enhance detection robustness.

The Environment-Aware Speech and Sound Deepfake Detection Challenge (ESDD2) is a competitive benchmarking initiative designed to advance the detection of AI-generated manipulations of speech and environmental sounds under realistic acoustic, adversarial, and generator-diverse conditions. Distinctive in its focus on both foreground speech and ambient environmental audio, ESDD2 evaluates detection systems on their ability to identify a range of spoofing manipulations, including independent and combined modifications to speech and background tracks, with protocols designed to model real-world and adversarial deployment scenarios (Zhang et al., 12 Jan 2026, Yin et al., 6 Aug 2025).

1. Motivations, Scope, and Challenge Definition

The rapid progress in generative audio—encompassing text-to-speech (TTS), voice conversion (VC), and environmental sound synthesis—has enabled the creation of highly realistic and immersive audio that is increasingly difficult to distinguish from authentic content. This has introduced acute security and trust challenges in applications ranging from media production to misinformation campaigns (Gao et al., 13 Aug 2025, Guo et al., 23 Dec 2025).

ESDD2 addresses two main detection scenarios:

  • Environment-aware deepfake detection, where both foreground speech and environmental sound may be manipulated independently.
  • Generalization to unseen generative methods and real-world backgrounds, emphasizing robustness to new TTS/VC models, environmental reverberation, background noise, and low-resource ("black-box") spoofing (Yin et al., 6 Aug 2025, Zhang et al., 12 Jan 2026).

The challenge is structured to require models that are sensitive to component-level artifacts, can operate under extreme class imbalance, and maintain performance in the presence of real-world perturbations.

2. Data Resources and Manipulation Taxonomy

2.1. Foundational Corpora

  • CompSpoofV2 Dataset: Over 250,000 audio clips (∼283 hours), with five strictly disjoint classes encoding combinations of bona fide and spoofed speech and environment. Speech sources are drawn from collections including CommonVoice, LibriTTS, ASV5, MLAAD, and background from AudioCaps, VGGSound, EnvSDD, UrbanSound, and VCapAV (Zhang et al., 12 Jan 2026).
  • EnvSDD Dataset: 45.25 hours real, 316.7 hours fake environmental audio, sourced from UrbanSound8K, TAU UAS 2019, TUT SED 2016/2017, DCASE 2023 Task7, and Clotho. All clips standardized to 4 s, 16 kHz, capturing both monophonic and polyphonic acoustic scenes (Yin et al., 6 Aug 2025, Yin et al., 25 May 2025).
  • Perturbed Public Voices (P²V): 257,440 utterances (∼96% deepfake), with systematic adversarial perturbations (background noise from ESC-50, Gaussian noise, air absorption, compression, pitch/time shift, band-pass filtering, impulse–response convolution, clipping) targeting voice cloning detection (Gao et al., 13 Aug 2025).

2.2. Deepfake Generation and Perturbation

  • Synthesis paradigms: Text-to-audio (TTA), audio-to-audio (ATA), and black-box paradigms using state-of-the-art models (e.g., AudioLDM, AudioGen, TangoFlux, AudioLCM, Diff-HierVC, FreeVC, XTTS-v2, VEVO, Zonos, IndexTTS) (Zhang et al., 12 Jan 2026, Gao et al., 13 Aug 2025, Yin et al., 6 Aug 2025).
  • Component-level manipulations: Speech and environment can be spoofed independently or jointly, yielding original, speech-only, environment-only, and mixed spoof conditions.
  • Perturbation Protocols: Randomized application of environmental/acoustic corruptions and adversarial-style noise with norm-bounded or deterministic transformations. P²V applies these as stress tests for model generalization (Gao et al., 13 Aug 2025).

3. Challenge Tasks, Evaluation Metrics, and Protocol

3.1. Task Structure

Track Description Primary Task
1 Unseen Generators Classification of test clips from generators unseen during training/training-only on known systems (Yin et al., 6 Aug 2025, Guo et al., 23 Dec 2025)
2 Black-Box Low-Resource Adaptation/detection on spoofed audio from paradigms with only ~1% labeled data, simulating ad hoc or proprietary generators (Yin et al., 6 Aug 2025, Guo et al., 23 Dec 2025)
– Component-level Spoofing Assigning 5-way class labels (original, bona fide_bona fide, spoof_bona fide, bona fide_spoof, spoof_spoof) at clip-level (Zhang et al., 12 Jan 2026)

3.2. Metrics

EER={τ∣Pmiss(τ)=Pfa(τ)}\mathrm{EER} = \{\tau \mid P_{\rm miss}(\tau) = P_{\rm fa}(\tau)\}

  • Macro-F1: For five-class/multicomponent tasks, mean of F1 over all classes; used for leaderboard ranking in CompSpoofV2/ESDD2 (Zhang et al., 12 Jan 2026).

Macro-F1=1K∑i=1KF1i\mathrm{Macro}\text{-}\mathrm{F1} = \frac{1}{K} \sum_{i=1}^{K} F1_i

  • Deepfake Detection Score (DDS): For P²V, composite of F1, AUC, and 1−EER1-\mathrm{EER} on the deepfake class (Gao et al., 13 Aug 2025).

3.3. Submission and Evaluation

Submissions are typically TXT files with either clip-wise scores or predicted class IDs, uploaded to CodaBench or CodaLab platforms. Leaderboard ranking is determined by the relevant primary metric (EER or Macro-F1), using averaged performance over multiple runs, with hidden ground-truth labels for test splits (Yin et al., 6 Aug 2025, Zhang et al., 12 Jan 2026).

4. Baseline Detection Systems and Model Innovations

4.1. Baseline Architectures

  • BEATs+AASIST: BEATs foundation model pretrained on AudioSet-2M (patch-embedded mel spectrogram) plus AASIST spectro-temporal graph attention backend (Yin et al., 25 May 2025, Yin et al., 6 Aug 2025). Achieves 5.81%5.81\% (TTA), 1.33%1.33\% (ATA) average EER (Yin et al., 25 May 2025).
  • AASIST: End-to-end model with graph attentional blocks over learned spectrogram features; $0.30$M parameters (Yin et al., 6 Aug 2025).
  • Separation-Enhanced Joint Learning: (ESDD2 baseline) Mixture detection, speech/environment separation (e.g., Conv-TasNet), per-component anti-spoofing, and fusion for classification over five compound classes (Zhang et al., 12 Jan 2026).
  • RawNet3, LCNN, SpecRNet, MesoNet: Evaluated on P²V using combinations of time-domain, spectral, and Whisper embeddings to capture diverse artifacts (Gao et al., 13 Aug 2025).

4.2. Advanced Systems

  • EnvSSLAM-FFN: Frozen SSLAM transformer encoder with softmax-weighted fusion of layers 4–9, attentive statistics pooling, lightweight FFN classifier, and class-weighted loss to directly address real ≪ spoof imbalance. Achieves 1.20%1.20\% and 1.05%1.05\% EER on Tracks 1 and 2, respectively (Guo et al., 23 Dec 2025).
  • BEAT2AASIST: Dual-branch (frequency/channel split) AASIST with top-k transformer layer fusion (concatenation, CNN-gated, SE-gated). Incorporates vocoder-based augmentation for robustness to GAN artifacts (Chung et al., 17 Dec 2025).
  • BUT SSL Ensembles: Multi-front-end (BEATs, EAT, Dasheng, WavLM, HuBERT) with Multi-Head Factorized Attention backend and distribution-uncertainty augmentation (DSU); fusion of diverse SSLs lowers EER on Track 1 to 3.52%–4.38% (Peng et al., 9 Dec 2025).

Summary Table: Selected System EERs

System Track 1 EER Track 2 EER Dataset
BEATs+AASIST 13.20% 12.48% EnvSDD (Yin et al., 6 Aug 2025)
EnvSSLAM-FFN 1.20% 1.05% EnvSDD (Guo et al., 23 Dec 2025)
BEAT2AASIST 1.70% 0.46%* EnvSDD (Chung et al., 17 Dec 2025)
BUT Ensemble 3.52% – EnvSDD (Peng et al., 9 Dec 2025)
ESDD2 Baseline – 0.63 Macro-F1 CompSpoofV2 (Zhang et al., 12 Jan 2026)

*Best single-model result; ensemble yields further improvement

5. Key Findings, Ablations, and Generalization

  • Layer Fusion: Mid-level transformer layers (e.g., 4–9 in SSLAM/BEATs) are most predictive for deepfake detection, as shown by learnable softmax fusion. Restricting attention to these improves EER over full-layer fusion (Guo et al., 23 Dec 2025, Chung et al., 17 Dec 2025).
  • Pooling and Attention: Attentive statistics pooling, MHFA, and SE-gated fusion modules enhance discrimination by focusing on artifact-laden segments or layer-channel regions (Peng et al., 9 Dec 2025, Guo et al., 23 Dec 2025, Chung et al., 17 Dec 2025).
  • Class Weighting and Distribution Augmentation: Explicit correction for class imbalance via inverse-frequency weighting, as well as feature-domain augmentation (DSU), increase test robustness, especially under black-box or distribution-shift conditions (Peng et al., 9 Dec 2025, Guo et al., 23 Dec 2025).
  • Domain Adaptation: Fine-tuning on AudioSet-2M or including vocoder-based augmentations (HiFi-GAN, BigV-GAN, UnivNet) substantially reduce EER on unseen generator tasks (Chung et al., 17 Dec 2025, Peng et al., 9 Dec 2025).
  • Component Separation: For mixed speech–environment audio, joint modeling (with neural separation and per-component detectors) outperforms holistic detectors (Zhang et al., 12 Jan 2026).

A plausible implication is that representation learning strategies should emphasize mid-layer fusion, explicit class imbalance correction, and model uncertainty to close the generalization gap exposed by adversarially or environmentally perturbed test data.

6. Limitations, Open Problems, and Future Directions

  • Environmental Complexity: EER remains >40% for environmental-only spoofing in the multi-component setting, underscoring the need for more powerful background modeling (Zhang et al., 12 Jan 2026).
  • Adversarial Robustness: Simple additive noise and basic reverberation still erode detector performance by up to ~16% (DDS); advanced adversarial attacks (e.g., PGD, CW) remain a future integration point (Gao et al., 13 Aug 2025).
  • Imbalance and Representation Bias: Class compositions are often deepfake-heavy, necessitating further study into label-distributional shift, especially for real-world deployment (Gao et al., 13 Aug 2025, Guo et al., 23 Dec 2025).
  • Language and Domain Coverage: Datasets like P²V are monolingual and focus on deceased public figures, inviting extension to live, multilingual, and conversational corpora.
  • Detection–Attribution Distinction: Many current metrics measure class discrimination but not generator attribution or artifact localization; richer labeling and interpretability are open areas.

7. Significance and Community Impact

The ESDD2 challenge platform, through its diversified tasks, multi-environment test protocols, and open benchmarks, has established environment-aware deepfake detection as a technically distinct and urgent subfield of audio forensics. Its multi-component, real-world–motivated evaluation scheme, combined with the release of datasets (CompSpoofV2, EnvSDD, P²V), transparent benchmarking pipelines, and standardization of evaluation metrics, directly catalyzes progress toward robust, deployable, and generalizable deepfake detection in both speech and environmental sound contexts (Zhang et al., 12 Jan 2026, Yin et al., 6 Aug 2025, Gao et al., 13 Aug 2025, Yin et al., 25 May 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Environment-Aware Speech and Sound Deepfake Detection Challenge (ESDD2).