VoicePrivacy 2024 Challenge Protocol

Updated 21 January 2026

VoicePrivacy 2024 Challenge Protocol is a standardized evaluation framework ensuring that voice anonymization systems hide speaker identity while preserving linguistic and emotional features.
The protocol employs key metrics—Equal Error Rate (EER), Word Error Rate (WER), and Unweighted Average Recall (UAR)—to assess privacy, speech intelligibility, and emotional fidelity in both offline and real-time settings.
It details data splits, permitted training resources, and attacker simulation models to enable reproducible, transparent benchmarking of privacy-preserving speech technologies.

The VoicePrivacy 2024 Challenge Protocol defines the standardized framework for evaluating voice anonymization systems with respect to privacy preservation (concealing speaker identity), linguistic intelligibility, and emotional expressiveness. This protocol is a product of the VoicePrivacy Initiative, providing datasets, baseline systems, and evaluation scripts to ensure fair and reproducible assessment across submissions. Systems are evaluated in both offline and real-time streaming settings, and the protocol’s rigor has anchored both the central anonymization challenge (Tomashenko et al., 2024, Tomashenko et al., 17 Jan 2026) and adversarial attacker challenges (Tomashenko et al., 2024), ultimately advancing the scientific study and deployment of privacy-preserving speech technologies.

1. Protocol Objectives and Task Definition

The principal task is to transform an input utterance $x$ into an output signal $\tilde{x}$ such that:

Privacy: Speaker identity is concealed at the utterance level. Attacker ASV systems should yield high error rates on $\tilde{x}$ .
Linguistic Utility: Automatic speech recognition (ASR) systems trained on clean data maintain low word error rates (WER) on $\tilde{x}$ .
Emotional Utility: Speech emotion recognition (SER) systems, also trained on clean data, must preserve emotion as measured by unweighted average recall (UAR) on $\tilde{x}$ .

Protocol requirements stipulate the output must be a waveform, anonymization must occur independently for each utterance (randomized or pseudo-random per utterance allowed), and the process can never rely on true speaker labels. No human-in-the-loop adaptation or side-channel identity leakage is permitted (Tomashenko et al., 2024, Kuzmin et al., 20 Jan 2026).

2. Datasets, Data Splits, and Training Resources

Evaluation Datasets

LibriSpeech 16 kHz: The basis for privacy and ASR evaluation.
- Development (dev-clean): 29 enrollment speakers (15 F, 14 M), 343 enrollment utterances; 40 trial speakers (20 F, 20 M), 1978 trial utterances.
- Evaluation (test-clean): 29 enrollment speakers (16 F, 13 M), 438 enrollment utterances; 40 trial speakers (20 F, 20 M), 1496 trial utterances.
IEMOCAP 16 kHz: The reference for SER/Emotion evaluation.
- 12 hours of acted conversational speech, 10 actors (5 F, 5 M), 4 emotion classes (neutral, sadness, anger, happiness/excitement). Evaluation uses leave-one-conversation-out cross-validation (Tomashenko et al., 17 Jan 2026, Yao et al., 2024).

Permitted Training Resources

The protocol includes a published, immutable list of allowed corpora (LibriSpeech, VoxCeleb, LibriTTS, ESD, CREMA-D, RAVDESS, VCTK, etc.) and pretrained models (WavLM, HuBERT, wav2vec 2.0, ECAPA-TDNN, EnCodec, NaturalSpeech 3, HiFi-GAN, VITS, etc.). Only public resources listed in the finalized whitelist may be used (Tomashenko et al., 2024, Tomashenko et al., 2024).

Model Training and Inference

Anonymizer Training: May use all permitted resources, but never the dev/eval splits.
Evaluation Models: ASV, ASR, and SER evaluation models are trained exclusively on unmodified training partitions, as prescribed by the organizers.

3. Attacker Models and Threat Scenarios

The protocol formalizes attacker knowledge through several models (Tomashenko et al., 2024, Tomashenko et al., 2024, Kuzmin et al., 20 Jan 2026):

Lazy-informed: Attacker knows the anonymization algorithm but uses an ASV model trained only on original speech data. Trials: anonymized test utterances against original enrollments.
Semi-informed: Attacker both knows and can simulate the anonymization system, re-training or fine-tuning their ASV model on anonymized speech. Trials: anonymized test utterances against anonymized enrollments.
Uninformed/Other: Attacker does not know or ignores anonymization; these scenarios are mentioned but not used for main result reporting.

For all attacker conditions, the recommended backbone is ECAPA-TDNN, trained as specified in the protocols (details: 512 convolutional channels, 80-dim Mel-filterbank input, 192-dim embeddings, Adam optimizer) (Tomashenko et al., 2024).

Table 1. Attacker Knowledge and Trial Organization

Attacker	Knows Algorithm	ASV Training Data	Trials	Metric
Lazy-informed	Yes	Original LibriSpeech 360h	Anon-test vs. Orig-enroll	EER
Semi-informed	Yes	Anonymized LibriSpeech 360h	Anon-test vs. Anon-enroll	EER

4. Evaluation Metrics and Composite Ranking

The protocol defines three core, challenge-wide metrics:

Equal Error Rate (EER) For an ASV system outputting similarity scores $s$ $s$ ,
- $\mathrm{FAR}(\tau) = \Pr(s > \tau \mid \text{impostor})$ ,
- $\mathrm{FRR}(\tau) = \Pr(s \leq \tau \mid \text{target})$ , EER is the threshold $\tau^*$ where

$\mathrm{EER} = \mathrm{FAR}(\tau^*) = \mathrm{FRR}(\tau^*)$

High EER indicates strong privacy.

Word Error Rate (WER)

$\mathrm{WER} = \frac{S + D + I}{N_{\text{ref}}}\times 100\%$

where $S, D, I$ are substitutions, deletions, and insertions. Lower WER represents better linguistic fidelity.

Unweighted Average Recall (UAR) Over $C$ emotion classes,

$\mathrm{UAR} = \frac{1}{C} \sum_{c=1}^{C} \frac{\mathrm{TP}_c}{\mathrm{TP}_c + \mathrm{FN}_c} \times 100\%$

Higher UAR denotes better emotion preservation.

Systems are ranked within four operating points defined by minimum target EERs ( $10\%$ , $20\%$ , $30\%$ , $40\%$ ). Within each EER category, eligible systems are ranked separately by lowest WER and highest UAR, exposing the privacy-utility trade-off envelope (Tomashenko et al., 2024, Tomashenko et al., 17 Jan 2026).

5. Streaming and Latency-Aware Configurations

The protocol supports evaluation in both offline and streaming (real-time) scenarios:

Fixed Delay: A constant frame lookahead $d$ (e.g., $d=4$ frames $\rightarrow \approx 200$ ms measured latency).
Dynamic Delay: During training, $d$ sampled from $\mathrm{Uniform}\{1,\dots,8\}$ ; inference may use fixed or re-sampled $d$ , spanning $\approx 130$ ms to $440$ ms latency (Kuzmin et al., 20 Jan 2026).

Latency, privacy, and utility trade-offs are profiled jointly. For example, under Stream-Voice-Anon, privacy (EER) remains robust across latencies, while intelligibility (WER) drops as latency decreases, showing a plateau at moderate delays:

Latency (ms)	EER (%)	WER (%)
130	~47	~15
180	~47	~4.7
200 (baseline)	~47	~4.7

This enables explicit comparison of real-time constraints and privacy conditions (Kuzmin et al., 20 Jan 2026).

6. Submission Protocols and Compliance Requirements

All submissions must comply with the following:

Automatic Processing: No manual or speaker-specific adaptation is permitted at any stage.
Waveform Output: All anonymized data must be submitted as 16 kHz, 16-bit PCM WAV files; precise directory and filename conventions as prescribed by the evaluation scripts (typically under exp/).
Documentation: A system description (2–6 pages) detailing architecture, hyperparameters, anonymization strategy, randomization details, and parameters/codebooks must accompany submissions.
Randomization and Determinism: If the anonymization function is randomized, the system description must specify the randomization protocol; repeated utterance passes may produce variable outputs, but all sources of randomness must be disclosed.
Evaluation Pipeline: The official scripts automatically generate EER, WER, UAR and summary logs. Only scores from these scripts are considered for ranking and reporting.
Compliance Checks: The challenge organizers reserve the right to re-run submitted systems to verify results, and to disqualify systems that misuse data or violate utterance-level anonymization principles (Tomashenko et al., 2024, Kuzmin et al., 20 Jan 2026).

7. Relation to Specialized Attacker Challenges

The protocol forms the exclusive testbed for the VoicePrivacy 2024 Attacker Challenge, where participants design adversarial ASV systems to minimize EER on anonymized speech. Attacker submissions must follow the official score file format, use only permitted models and data, and report EERs separately for gender and evaluation phase. The ECAPA-TDNN semi-informed baseline and official trial organization are shared between the main anonymization and attacker protocols, ensuring commensurability of privacy assessments (Tomashenko et al., 2024).

Significance and Impact

The VoicePrivacy 2024 Challenge Protocol establishes a transparent, reproducible, and rigorous standard for evaluating utterance-level voice anonymization systems. Its combination of privacy, linguistic, and emotional preservation metrics, along with strict data usage controls and support for real-time testing, set a high bar for both system innovation and security assessment. The protocol’s comprehensive attacker modeling represents the current state-of-the-art in privacy threat simulation for speaker anonymization research (Tomashenko et al., 17 Jan 2026, Tomashenko et al., 2024, Tomashenko et al., 2024).