Papers
Topics
Authors
Recent
Search
2000 character limit reached

CompSpoofV2: Component-Level Audio Deepfake Dataset

Updated 19 January 2026
  • CompSpoofV2 is a large-scale curated dataset featuring 250,000 audio clips with dual labels for both speech and environment, enabling rigorous deepfake detection research.
  • The dataset uses a five-class taxonomy and state-of-the-art generative techniques to model nuanced component-level manipulations in mixed-signal recordings.
  • Baseline detection frameworks employing separation-enhanced joint learning reveal challenges in unseen conditions, guiding future research toward robust spoof detection.

CompSpoofV2 is a large-scale, curated dataset developed for the detection of component-level audio deepfakes, where either the speech or the environmental sound—or both—may be independently synthesized or manipulated. Designed to address the increasing complexity of audio forgeries in real-world, mixed-signal recordings, CompSpoofV2 underpins the Environment-Aware Speech and Sound Deepfake Detection Challenge (ESDD2), focusing specifically on scenarios where traditional whole-utterance anti-spoofing methods prove inadequate due to selective component manipulation (Zhang et al., 12 Jan 2026).

1. Dataset Structure and Class Taxonomy

CompSpoofV2 comprises approximately 250,000 audio clips, totaling nearly 283 hours of curated content. Each clip consists of two distinct components: foreground speech and background environmental sound, each potentially labeled as bona fide (genuine) or spoofed (synthetically generated or manipulated). The dataset is organized into five mutually exclusive classes representing all combinations of bona fide and spoofed status for these two components:

ID Mixed Speech Environment Class Label Description
0 bona fide bona fide original Original audio, no mixing or manipulation
1 bona fide bona fide bonafide_bonafide Genuine speech mixed with genuine environment (different recordings)
2 spoofed bona fide spoof_bonafide Spoofed speech mixed with genuine environment
3 bona fide spoofed bonafide_spoof Genuine speech mixed with spoofed environment
4 spoofed spoofed spoof_spoof Both speech and environment are spoofed

Training and validation splits constitute approximately 80% of the total data, sharing both data sources and class distributions, while evaluation and test splits utilize held-out or newly generated (“unseen”) content, comprising the remaining 20%. Precise split counts and durations are withheld for challenge protocol alignment (Zhang et al., 12 Jan 2026).

2. Data Generation and Annotation

Component-level spoofing in CompSpoofV2 is constructed through state-of-the-art generative and conversion techniques. Speech spoofing leverages contemporary TTS and voice-conversion systems, including those used in ASVspoof 5 (“ASV5”), MLAAD, and conversions between CommonVoice and LibriTTS. Environmental sound spoofing employs generative models of environmental tracks, as featured in EnvSDD and VCapAV. Evaluation and test sets incorporate both previously “seen” (held-out) and “newly generated” synthetic audio based on models or parameters not encountered during training.

Each audio file is annotated with:

  • Audio ID
  • Per-component labels for speech (bona fide/spoofed) and environment (bona fide/spoofed)
  • Overall class label (0–4)
  • Source dataset identifiers (for both speech and environment)
  • Partition tag (train/val/eval/test)

This annotation facilitates precise ground-truth alignment for both component-specific and overall detection objectives.

3. Baseline Detection Framework

The baseline architecture for CompSpoofV2 and the ESDD2 challenge is a separation-enhanced joint learning framework. The core workflow is as follows:

  1. A binary mixture detector flags each clip as either “original” (ID 0) or “mixed” (IDs 1–4).
  2. For mixed clips, a separator module decomposes audio into speech and environment streams.
  3. Two dedicated subnetworks, one for speech anti-spoofing and another for environment anti-spoofing, produce spoof likelihood scores for their respective components.
  4. A fusion layer integrates these scores (along with the mixture indicator) and computes a five-way softmax corresponding to the official CompSpoofV2 classes.

The joint training objective,

Ltotal=αLsep+βLdetL_{\rm total} = \alpha\,L_{\rm sep} + \beta\,L_{\rm det}

combines a separation loss (LsepL_{\rm sep}, e.g., SI-SNR or L2 between estimated and true separated components) and a multi-class detection loss (LdetL_{\rm det}, cross-entropy over the five class outputs), with weights α\alpha and β\beta.

This approach enables simultaneous learning of both component separation and class discrimination, reflecting the challenge’s emphasis on component-level detection in mixed and adversarial audio settings.

4. Evaluation Protocol and Baseline Metrics

Evaluation in ESDD2 employs the Overall Macro-F1 measure across all five classes as the primary performance criterion:

Macro-F1=15i=15F1i\textrm{Macro-F1} = \frac{1}{5} \sum_{i=1}^5 \mathrm{F1}_i

with standard class-wise precision and recall definitions.

Auxiliary metrics include:

  • EER_original: Equal Error Rate between “original” (ID 0) and all “mixed” classes.
  • EER_speech: EER of spoofed speech detection.
  • EER_env: EER of spoofed environmental detection.

An optional tandem detection cost function (t-DCF) is also provided:

t-DCF=C1Pmiss+C2Pfa\textrm{t-DCF} = C_1\,P_{\rm miss} + C_2\,P_{\rm fa}

where PmissP_{\rm miss} and PfaP_{\rm fa} are miss and false alarm probabilities, and C1C_1, C2C_2 are user-specified costs.

Split Original EER Speech EER Env EER Macro-F1
Val 0.0031 0.0172 0.3766 0.9462
Eval 0.0174 0.1993 0.4336 0.6224
Test 0.0173 0.1978 0.4279 0.6327

These baseline figures highlight a distinct drop-off in F1 and increased error rates in the eval/test splits, reflecting the increased challenge posed by unseen generative conditions (Zhang et al., 12 Jan 2026).

5. Application Domains and Motivations

CompSpoofV2 and its evaluation protocols are motivated by threats in scenarios where attackers selectively manipulate only one component of an audio stream. Examples include:

  • Voice authentication spoofing via environmental replacement (e.g., synthetic office ambience)
  • Command injection through speech-only manipulation
  • Broadcast media tampering and forensic audio alteration
  • Adversarial attacks targeting voice-assistant platforms

Key downstream uses involve:

  • Enhanced anti-spoofing modules for speaker verification
  • Forensic localization and identification of manipulated segments
  • Development of watermarking or provenance-tracking tools for both speech and environmental tracks

This focus on component-level forensics is a direct response to evolving deepfake techniques that are not addressed by legacy whole-signal anti-spoofing approaches.

6. Open Challenges and Research Directions

Several outstanding challenges are emphasized in the CompSpoofV2 methodology and motivation:

  • Domain shift: Performance can degrade against generative models or manipulations not represented in the training set.
  • Imperfect separation: Separators may leave residual crosstalk between speech and ambient streams, potentially masking cues necessary for spoof detection.
  • Unseen environmental or linguistic conditions: New background types or speech accents may present detection blind spots.

A plausible implication is that robust generalization under novel synthesis conditions and improved component separation will remain significant research frontiers for component-level anti-spoofing.

7. Significance in the Audio Forensics Research Landscape

CompSpoofV2, coupled with its separation-enhanced baseline and granular evaluation plan, represents a paradigm shift toward fine-grained, multi-component audio forensics. By structuring the ESDD2 Challenge around these data and detection principles, the initiative aims to galvanize research into more realistic and targeted deepfake detection, advancing both dataset realism and methodological rigor (Zhang et al., 12 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CompSpoofV2 Dataset.