- The paper presents an innovative framework that integrates native TTS as synthetic ground-truth with a bifurcated training strategy for enhanced accent conversion.
- It employs the VITS framework for effective latent alignment, mitigating non-native pronunciation errors while preserving speaker identity.
- Comprehensive evaluations show significant improvements in word error rate and accent accuracy over traditional methods.
Improving Pronunciation and Accent Conversion through Knowledge Distillation And Synthetic Ground-Truth from Native TTS
The paper addresses the complex task of accent conversion (AC), aiming to make non-native speech more closely resemble native speech in terms of pronunciation while retaining the speaker's identity. It advances previous AC methodologies by integrating enhanced pronunciation correction and knowledge distillation techniques, building a framework that leverages native Text-to-Speech (TTS) as a synthetic ground-truth generator.
The research introduces a novel approach utilizing the Variational Inference with Text-to-Speech (VITS) framework. This choice capitalizes on VITS' end-to-end architecture, circumventing the need for a separate vocoder and aligning input-output efficiently via monotonic alignment search. The framework improves both pronunciation and accent conversion through a bifurcated training strategy.
Methodological Contributions
- Native TTS and Pre-trained AC Model: The initial training stage involves simultaneous pre-training of the AC model with a native TTS system using the VITS framework. This stage captures the distribution of native audio data by utilizing native-like synthetic ground-truth generated from TTS.
- Synthetic Ground-Truth Generation: The native TTS framework is employed to create ideal ground-truth audio from non-native inputs by accurately aligning latent variables and maintaining speaker identity and prosody, thereby addressing issues of pronunciation.
- Fine-tuning with Ground-Truth Data and Knowledge Distillation: The AC model undergoes fine-tuning using generated ground-truth audios and native knowledge distillation, focusing on capturing accent-independent linguistic features without degrading speaker identity.
The authors present a comprehensive evaluation metric suite, incorporating both subjective (Nativeness, Sim-MOS) and objective (WER, Accent Accuracy, SECS) tests. These evaluations emphasize the system's capability to enhance pronunciation with native-like clarity, achieving significant improvements in WER over baseline methods while maintaining high speaker similarity.
Significance and Implications
The proposed approach stands out in terms of handling insufficient parallel corpora issues and compensating for non-native pronunciation errors. By generating accent-independent linguistic representations through native TTS models, the proposed method effectively trains the AC model to improve overall speech intelligibility and accent perception without compromising speaker identity.
Future Directions
This research opens avenues for further advancement in the field of AC and TTS. Future work could explore aspects such as emotional tone preservation, prosodic variations, and integrating diverse linguistic features that could further enhance the naturalness and expressiveness of converted speech.
In summary, this paper presents a substantial advancement in the field of accent conversion and pronunciation enhancement, illustrating a pathway forward for researchers looking to refine speech synthesis systems. The methodological robustness and effectiveness of the proposed framework hold promise for significant practical applications in multilingual communication systems.