ProsodyEval: Prosodic Prominence Benchmark
- The paper introduces a benchmark for precise measurement and control of word-level prosody using multi-speaker corpora, acoustic feature extraction, and perceptual verification.
- BERT-based models achieve up to 83.2% accuracy in binary classification, demonstrating robust performance in predicting prosodic prominence.
- The benchmark facilitates TTS conditioning by integrating explicit prosodic labels with phoneme embeddings and diverse evaluation protocols.
ProsodyEval is a standardized benchmark suite for evaluating prosodic prominence and prosody diversity in text-to-speech (TTS) and NLP scenarios. It centers on the precise measurement, prediction, and control of prosodic adaptation at the word and utterance levels, using both large annotated corpora and perceptual experiments. The benchmark features multi-level acoustic annotation, word-level prominence quantization, contextual modeling, and robust task protocols for model training, synthesis, and human evaluation. It serves as the reference resource for prosodic prominence prediction and for quantifying diversity of prosody in zero-shot and controllable TTS, incorporating both acoustic and perceptual metrics.
1. Corpus Design and Annotation Protocol
ProsodyEval datasets use extensive multi-speaker speech corpora with rigorous acoustic and temporal alignment, leveraging forced-aligners (Montreal Forced Aligner) for precise segmentation at the word and phoneme levels. Major datasets include:
- LibriTTS "0" partitions: 262.5 hours, 1,230 speakers, over 2.8 million word tokens (Talman et al., 2019).
- Multi-novel, multi-speaker readings from Librivox and Blizzard 2013, >40k utterances per corpus (Stephenson et al., 2022).
- Synthetic utterance sets: 1,000 samples from 7 mainstream zero-shot TTS systems (Yang et al., 24 Sep 2025).
Automatic word-level prominence is annotated by composite acoustic feature extraction (F0, energy, duration) and hierarchical wavelet-based quantization:
- F0, energy, duration are extracted frame-wise, smoothed, interpolated (z-normalized), and fused (canonical weights: F0=1.0, energy=0.5, duration=1.0) (Talman et al., 2019).
- Continuous wavelet transform (CWT) operationalizes prominence as the maxima across wavelet scales, yielding a continuous score for each word.
- Discretization uses tertile or corpus-specific thresholds, typically producing three classes:
- 0 = non-prominent
- 1 = somewhat prominent
- 2 = very prominent
- In test sets, manual verification with native raters provides quality control, achieving Cohen’s κ up to 0.90 (Stephenson et al., 2022).
2. Prosodic Prominence Prediction Task
ProsodyEval frames prominence prediction as a sequence-labeling or token-classification problem, evaluating for both binary and three-class settings:
- Input: Sentence .
- Output: Label vector , (binary) or (ternary) (Talman et al., 2019).
Primary metrics:
- Classification accuracy, macro-averaged precision, recall, :
Benchmark results on the large LibriTTS test set: | Model | 2-way Accuracy | 3-way Accuracy | |-------------------------|---------------|----------------| | BERT-base (fine-tuned) | 83.2% | 68.6% | | BiLSTM (3×600D) | 82.1% | 66.4% | | CRF (MarMoT) | 81.8% | 66.4% | | SVM+GloVe | 80.8% | 65.4% | | Majority per word | 80.2% | 62.4% | (Talman et al., 2019)
BERT-based models outperform other baselines, maintaining performance with limited training data (accuracy drop ≤ 1% with only 10% of the data). Label 2 (“very prominent”) is detected with highest recall by BERT.
3. TTS Conditioning and Prominence Control
ProsodyEval provides protocols for evaluating controllability of prosodic prominence in synthesis:
- TTS models: FastSpeech 2 (non-auto-regressive transformer), Parallel WaveGAN vocoder (Stephenson et al., 2022).
- Conditioning: Word-level prominence labels (usually quantized as {p0, p1, p2}) are embedded and concatenated with phoneme embeddings, optionally augmented with clustered speaker embeddings.
- Controllability assessment: Listening tests with native raters score prominence on a discrete scale, e.g., possessive pronouns synthesized with c ∈ {p0, p1, p2} are consistently ranked median 0, 0.5, and 1, respectively.
- Synthesis protocol: Controlled perturbation of single word-labels with ground-truth context; perceptual ranking and pairwise comparison collect subjective and objective control data.
A plausible implication is that such explicit conditioning enables partial and context-sensitive control of prosodic focus, though generalization to subjective pronoun types remains limited (Stephenson et al., 2022).
4. Human-Centric Prosody Evaluation
Perceptual evaluation in ProsodyEval distinguishes between overall quality (MOS) and fine-grained prominence judgment:
- Prosody Mean Opinion Score (PMOS): 1–5 Likert scale, aggregated over 2,000 ratings by expert researchers (Yang et al., 24 Sep 2025).
- Rapid Prosody Transcription (RPT): Listeners mark error locations at the word level in real time, producing probabilistic error densities:
Words at prosodic boundaries, especially preceding punctuation, consistently attract error markings, revealing local failures in synthetic prominence (Gutierrez et al., 2021).
Mapping of RPT error metrics to PMOS yields strong negative correlation ( for system means), substantiating that localized error annotation provides more discriminative evidence than MOS alone (Gutierrez et al., 2021).
5. Diversity Metrics and Benchmark Comparisons
ProsodyEval supports comparison of TTS modeling paradigms for prosody diversity using acoustic and semantic token-based metrics:
- Baselines: log F0 RMSE and Mel-cepstral distortion (MCD) capture partial acoustic variation:
- Discretized Speech Weighted Edit Distance (DS-WED): Measures semantic token sequence differences via weighted edit distance post k-means clustering of HuBERT- or WavLM-based speech embeddings ( tokens, peak at layer 8) (Yang et al., 24 Sep 2025). DS-WED outperforms acoustic baselines in correlation with PMOS: | Metric | Pearson | |-------------|------------------| | log F0 RMSE | 0.30 | | MCD | 0.66 | | DS-WED | 0.77 | (Yang et al., 24 Sep 2025)
DS-WED reveals AR and masked generative modeling (MGM) systems as substantially more prosodically diverse than NAR flow-matching systems, highlighting paradigm effects. Duration perturbation and reinforcement learning (DPO) modulate prosodic diversity (DP: +14% to +27%; DPO: –3% to –19%).
6. Limitations, Variability, and Future Directions
Scope and consistency remain active challenges:
- Corpora are currently English-only; cross-lingual generalization is untested (Yang et al., 24 Sep 2025).
- Inter-annotator agreement for word-level error marking is modest (–$0.3$), suggesting variable subjective interpretation (Gutierrez et al., 2021).
- DS-WED tokenization may underspecify microprosody; operation weights are tuned heuristically (Yang et al., 24 Sep 2025).
Future refinements recommended:
- Extend benchmarks to additional languages and spontaneous speech data.
- Augment perceptual annotations with stress, boundary tones, rhythm, and multi-dimensional prosodic metadata.
- Integrate acoustic and semantic metrics for comprehensive evaluation.
- Release full error-density maps, standardized annotation protocols, and open-source benchmark platforms.
7. Significance and Benchmarking Impact
ProsodyEval establishes the reference protocol for evaluating prosodic prominence both in prediction and synthesis:
- Validated datasets with robust multi-speaker, multi-text coverage facilitate model training and cross-system comparison (Talman et al., 2019, Stephenson et al., 2022).
- Human benchmarking (PMOS and RPT) enables system-level and token-level calibration against perception (Yang et al., 24 Sep 2025, Gutierrez et al., 2021).
- DS-WED metric is the first to consistently surpass acoustic baselines for correlation with perceptual diversity (Yang et al., 24 Sep 2025).
- The combined framework supports rigorous quantification of prominence and prosody diversity, standardizing evaluation for both research and applied TTS/NLP deployments.