Improving Pronunciation and Accent Conversion through Knowledge Distillation And Synthetic Ground-Truth from Native TTS

Published 19 Oct 2024 in cs.SD, cs.AI, and eess.AS | (2410.14997v2)

Abstract: Previous approaches on accent conversion (AC) mainly aimed at making non-native speech sound more native while maintaining the original content and speaker identity. However, non-native speakers sometimes have pronunciation issues, which can make it difficult for listeners to understand them. Hence, we developed a new AC approach that not only focuses on accent conversion but also improves pronunciation of non-native accented speaker. By providing the non-native audio and the corresponding transcript, we generate the ideal ground-truth audio with native-like pronunciation with original duration and prosody. This ground-truth data aids the model in learning a direct mapping between accented and native speech. We utilize the end-to-end VITS framework to achieve high-quality waveform reconstruction for the AC task. As a result, our system not only produces audio that closely resembles native accents and while retaining the original speaker's identity but also improve pronunciation, as demonstrated by evaluation results.

Abstract PDF HTML Upgrade to Chat

Summary

The paper presents an innovative framework that integrates native TTS as synthetic ground-truth with a bifurcated training strategy for enhanced accent conversion.
It employs the VITS framework for effective latent alignment, mitigating non-native pronunciation errors while preserving speaker identity.
Comprehensive evaluations show significant improvements in word error rate and accent accuracy over traditional methods.

Improving Pronunciation and Accent Conversion through Knowledge Distillation And Synthetic Ground-Truth from Native TTS

The paper addresses the complex task of accent conversion (AC), aiming to make non-native speech more closely resemble native speech in terms of pronunciation while retaining the speaker's identity. It advances previous AC methodologies by integrating enhanced pronunciation correction and knowledge distillation techniques, building a framework that leverages native Text-to-Speech (TTS) as a synthetic ground-truth generator.

The research introduces a novel approach utilizing the Variational Inference with Text-to-Speech (VITS) framework. This choice capitalizes on VITS' end-to-end architecture, circumventing the need for a separate vocoder and aligning input-output efficiently via monotonic alignment search. The framework improves both pronunciation and accent conversion through a bifurcated training strategy.

Methodological Contributions

Native TTS and Pre-trained AC Model: The initial training stage involves simultaneous pre-training of the AC model with a native TTS system using the VITS framework. This stage captures the distribution of native audio data by utilizing native-like synthetic ground-truth generated from TTS.
Synthetic Ground-Truth Generation: The native TTS framework is employed to create ideal ground-truth audio from non-native inputs by accurately aligning latent variables and maintaining speaker identity and prosody, thereby addressing issues of pronunciation.
Fine-tuning with Ground-Truth Data and Knowledge Distillation: The AC model undergoes fine-tuning using generated ground-truth audios and native knowledge distillation, focusing on capturing accent-independent linguistic features without degrading speaker identity.

The authors present a comprehensive evaluation metric suite, incorporating both subjective (Nativeness, Sim-MOS) and objective (WER, Accent Accuracy, SECS) tests. These evaluations emphasize the system's capability to enhance pronunciation with native-like clarity, achieving significant improvements in WER over baseline methods while maintaining high speaker similarity.

Significance and Implications

The proposed approach stands out in terms of handling insufficient parallel corpora issues and compensating for non-native pronunciation errors. By generating accent-independent linguistic representations through native TTS models, the proposed method effectively trains the AC model to improve overall speech intelligibility and accent perception without compromising speaker identity.

Future Directions

This research opens avenues for further advancement in the field of AC and TTS. Future work could explore aspects such as emotional tone preservation, prosodic variations, and integrating diverse linguistic features that could further enhance the naturalness and expressiveness of converted speech.

In summary, this paper presents a substantial advancement in the field of accent conversion and pronunciation enhancement, illustrating a pathway forward for researchers looking to refine speech synthesis systems. The methodological robustness and effectiveness of the proposed framework hold promise for significant practical applications in multilingual communication systems.