TTSDS2: Resources and Benchmark for Evaluating Human-Quality Text to Speech Systems

Published 24 Jun 2025 in cs.SD, cs.CL, and eess.AS | (2506.19441v1)

Abstract: Evaluation of Text to Speech (TTS) systems is challenging and resource-intensive. Subjective metrics such as Mean Opinion Score (MOS) are not easily comparable between works. Objective metrics are frequently used, but rarely validated against subjective ones. Both kinds of metrics are challenged by recent TTS systems capable of producing synthetic speech indistinguishable from real speech. In this work, we introduce Text to Speech Distribution Score 2 (TTSDS2), a more robust and improved version of TTSDS. Across a range of domains and languages, it is the only one out of 16 compared metrics to correlate with a Spearman correlation above 0.50 for every domain and subjective score evaluated. We also release a range of resources for evaluating synthetic speech close to real speech: A dataset with over 11,000 subjective opinion score ratings; a pipeline for continually recreating a multilingual test dataset to avoid data leakage; and a continually updated benchmark for TTS in 14 languages.

Abstract PDF Upgrade to Chat

Summary

The paper proposes TTSDS2, a novel metric that uses the 2-Wasserstein distance to measure distribution similarity between synthetic and real speech.
It incorporates four perceptual factors—generic, speaker, prosody, and intelligibility—and is validated across 14 languages.
The metric achieves an average Spearman correlation of 0.67 with human evaluations, providing a robust alternative to subjective MOS tests.

TTSDS2: Resources and Benchmark for Evaluating Human-Quality Text to Speech Systems

Introduction

The evaluation of Text to Speech (TTS) systems presents significant challenges, especially with the advent of systems capable of generating speech indistinguishable from human speech. Traditional subjective metrics, such as the Mean Opinion Score (MOS), although prevalent, are inconsistent across various studies, driving a need for objective metrics that consistently correlate with human evaluations. This study introduces the Text-to-Speech Distribution Score 2 (TTSDS2), an advanced and robust metric for evaluating TTS systems, designed to maintain high correlation with human evaluations across diverse domains and languages.

Figure 1: Overview of TTSDS2, utilizing a diverse collection of public and academic datasets for validation across multiple domains.

TTSDS2 Methodology and Features

The methodology behind TTSDS2 is rooted in measuring distributional similarity to real speech rather than individual sample comparisons, effectively addressing the inherent one-to-many nature of TTS tasks. TTSDS2 comprises four primary perceptual factors: Generic, Speaker, Prosody, and Intelligibility. Each factor derives scores based on multiple features such as SSL embeddings for generic similarity, speaker identity fidelity, pitch and rhythm for prosody, and ASR-derived features for intelligibility.

Figure 2 provides a visual representation of these features and their distributions, focusing on the feature-wise distances between synthetic and real speech datasets.

Figure 2: Distribution of $F_0$ in TTSDS for ground-truth, synthetic, and noise datasets.

The computation employs the 2-Wasserstein distance, a robust metric known for its properties suited to distribution comparison tasks.

Multilingual Application and Continuous Evaluation

To ensure TTSDS2's applicability beyond English, the methodology was extended to a multilingual context, incorporating 14 languages and involving a process of continuous quarterly benchmarking. This ensures the metric remains up-to-date with the latest advancements in TTS technology and dataset diversity. The multilingual process involves a curated pipeline for data collection and evaluation, minimizing data leakage and ensuring that the evaluation reflects real-world conditions.

Figure 3: TTSDS2 scores across 14 languages.

Correlation with Human Evaluations

TTSDS2 has been rigorously evaluated against human listening tests, consistently achieving a Spearman correlation exceeding 0.5 across all domains, with average correlations around 0.67. Such high correlation indicates TTSDS2's robustness and reliability in replicating human judgments of TTS output quality.

Figure 4: Correlation of three representative objective metrics with human MOS across the four datasets.

Limitations and Future Directions

Despite its advantages, TTSDS2's computational intensity could be a limitation, particularly when compared to simpler, less demanding metrics. Additionally, while it does not fully replace subjective listening tests, it serves as a highly reliable proxy. Future work may focus on reducing computational overhead and exploring further the uses of TTSDS2 in detecting more complex failure modes of TTS systems, such as nuanced contextual errors.

Conclusion

TTSDS2 represents a significant step forward in the objective evaluation of TTS systems, providing a more consistent and reliable metric correlated with human subjective evaluation across a wide array of domains and languages. By continuing to update and expand the benchmarking framework, TTSDS2 stands to significantly impact the development and refinement of future TTS technologies, ensuring they meet the ever-higher standards of linguistic and perceptual quality required by end-users.

Markdown Report Issue