Robust TTS Training via Self-Purifying Flow Matching for the WildSpoof 2026 TTS Track

Published 19 Dec 2025 in cs.SD and cs.AI | (2512.17293v1)

Abstract: This paper presents a lightweight text-to-speech (TTS) system developed for the WildSpoof Challenge TTS Track. Our approach fine-tunes the recently released open-weight TTS model, \textit{Supertonic}\footnote{\url{https://github.com/supertone-inc/supertonic}}, with Self-Purifying Flow Matching (SPFM) to enable robust adaptation to in-the-wild speech. SPFM mitigates label noise by comparing conditional and unconditional flow matching losses on each sample, routing suspicious text--speech pairs to unconditional training while still leveraging their acoustic information. The resulting model achieves the lowest Word Error Rate (WER) among all participating teams, while ranking second in perceptual metrics such as UTMOS and DNSMOS. These findings demonstrate that efficient, open-weight architectures like Supertonic can be effectively adapted to diverse real-world speech conditions when combined with explicit noise-handling mechanisms such as SPFM.

Abstract PDF Upgrade to Chat

Summary

The paper demonstrates that integrating self-purifying flow matching into a lightweight TTS architecture effectively mitigates the adverse effects of noisy text-speech alignments.
It achieves superior intelligibility with WER as low as 3.26% and maintains robust perceptual quality across multiple evaluation metrics in challenging real-world conditions.
The study evidences that incorporating explicit noise-handling mechanisms enables scalable and resilient TTS deployment without needing large, pristine datasets.

Robust TTS Training via Self-Purifying Flow Matching: Summary and Analysis

Background and Motivation

Text-to-speech (TTS) synthesis research has conventionally operated in controlled domains with curated, clean datasets, enabling reliable text–speech alignment and natural output. However, these ideal conditions constrain the scalability and applicability of TTS systems in diverse real-world scenarios, where background noise, device variability, reverberation, and annotation inconsistencies are prevalent. The WildSpoof Challenge 2026 provides a benchmark for evaluating models capable of robust adaptation to such in-the-wild data, emphasizing intelligibility, perceptual quality, and speaker faithfulness under adverse conditions.

Architectural Foundation: SupertonicTTS

The core architecture for this submission is SupertonicTTS (Kim et al., 29 Mar 2025), a lightweight, open-weight TTS framework composed of three principal modules: a speech autoencoder for compact continuous latent representation, a flow matching text-to-latent generator, and an utterance-level duration predictor utilizing cross-attention mechanisms. This design aims for efficient synthesis with reduced computational overhead, providing a suitable foundation for large-scale, resource-constrained deployment in noisy environments.

Self-Purifying Flow Matching (SPFM)

To address the critical issue of label noise and misalignment in in-the-wild speech, the authors integrate Self-Purifying Flow Matching (SPFM) (Kim et al., 23 Sep 2025) into the SupertonicTTS training regime. SPFM operates within the classifier-free guidance paradigm for conditional flow matching, leveraging the model’s conditional and unconditional loss signals to automatically detect unreliable text–speech pairs. Specifically:

Conditional Loss ( $\mathcal{L}_{\text{cond}}$ ): Measures alignment when text conditioning is present.
Unconditional Loss ( $\mathcal{L}_{\text{uncond}}$ ): Assesses reconstruction without text conditioning.

If $\mathcal{L}_{\text{cond}} > \mathcal{L}_{\text{uncond}}$ , the model interprets the pair as potentially mislabeled and re-routes it for unconditional training, thus preventing error propagation from questionable data while retaining valuable acoustic information from noisy samples. SPFM is activated post an initial warm-up phase to avoid unstable early detections.

Experimental Setup and Results

Training Protocol

The SupertonicTTS model was fine-tuned on the WildSpoof-provided TITW-easy and TITW-hard subsets, employing a balanced sampling strategy and 10,000 training iterations on four high-performance GPUs. SPFM was applied during fine-tuning to maximize reliable conditional generation and resilient unconditional learning.

Evaluation Metrics

The system was evaluated on four subsets differentiating speaker and text familiarity (KS/US × KT/UT), measuring:

Intelligibility: Word Error Rate (WER), Character Error Rate (CER)
Perceptual Quality: UTMOS, DNSMOS
Speaker Faithfulness: Speaker similarity (Spk-sim), Mel Cepstral Distance (MCD; for subsets with original recordings)

Performance Highlights

On internal validation sets, the model demonstrated superior intelligibility (WER as low as 3.26% for KSKT), stable perceptual scores (UTMOS 3.57–4.03; DNSMOS 2.96–3.19), and consistent speaker similarity, with low spectral distortion (MCD of 8.59 dB for KSKT). Official challenge results confirmed these findings with the lowest WER for both seen (5.50%) and unseen (5.88%) speaker conditions among all teams. The system also achieved the highest UTMOS for unseen speakers, though ranking second overall in perceptual metrics by a small margin.

Claims, Implications, and Comparisons

The authors assert that:

SPFM substantially mitigates the degradation from noisy text–speech alignment, directly contributing to improved pronunciation accuracy and consistent perceptual quality.
A compact architecture augmented by explicit noise-handling mechanisms achieves competitive performance without requiring large-scale, high-fidelity datasets or ensemble methods.

These claims are supported by strong quantitative results, demonstrating that lightweight open-weight models combined with dynamic data purification protocols can rival—and in intelligibility, surpass—more complex or diffusion-based approaches under unconstrained conditions.

Practical and Theoretical Impact

Practically, the approach offers:

Enhanced adaptability for TTS deployments in real-world settings, where training data curation is infeasible.
Reduced data engineering costs by enabling robust learning from mixed-quality corpora.

Theoretically, SPFM provides a probabilistic, model-driven strategy for selective region training in conditional generation, potentially generalizable to other modalities and architectures beyond TTS (e.g., conditional generative modeling in vision, translation, or multimodal tasks).

Future Directions

Possible extensions include:

Integrating more sophisticated confidence estimation schemes for label reliability.
Combining SPFM with contrastive or self-supervised objectives to further leverage noisy data.
Scaling to multilingual, cross-domain adaptation and pursuing real-time synthesis for edge deployment.

Conclusion

This study demonstrates that noise-aware training via Self-Purifying Flow Matching, layered on an efficient SupertonicTTS backbone, yields state-of-the-art intelligibility and robust perceptual quality in challenging real-world conditions. The method presents a compelling path forward for scalable, resilient TTS synthesis well-suited to unconstrained, heterogeneous speech datasets.

Reference: "Robust TTS Training via Self-Purifying Flow Matching for the WildSpoof 2026 TTS Track" (2512.17293).