Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling

Published 7 Mar 2023 in cs.CL, cs.AI, cs.SD, and eess.AS | (2303.03926v1)

Abstract: We propose a cross-lingual neural codec LLM, VALL-E X, for cross-lingual speech synthesis. Specifically, we extend VALL-E and train a multi-lingual conditional codec LLM to predict the acoustic token sequences of the target language speech by using both the source language speech and the target language text as prompts. VALL-E X inherits strong in-context learning capabilities and can be applied for zero-shot cross-lingual text-to-speech synthesis and zero-shot speech-to-speech translation tasks. Experimental results show that it can generate high-quality speech in the target language via just one speech utterance in the source language as a prompt while preserving the unseen speaker's voice, emotion, and acoustic environment. Moreover, VALL-E X effectively alleviates the foreign accent problems, which can be controlled by a language ID. Audio samples are available at \url{https://aka.ms/vallex}.

Abstract PDF Upgrade to Chat

Citations (148)

View on Semantic Scholar

Summary

The paper introduces VALL-E X, a cross-lingual neural codec model that achieves zero-shot TTS and S2ST while preserving the original speaker’s voice.
It leverages both autoregressive and non-autoregressive models trained on large-scale multilingual datasets like LibriLight and WenetSpeech.
Experimental results show significant improvements in speaker similarity, reduced WER, and enhanced speech naturalness compared to state-of-the-art methods.

Cross-Lingual Neural Codec Language Modeling for Speech Synthesis

The presented paper addresses a significant challenge in the field of speech synthesis, extending the capabilities of neural models to enable cross-lingual text-to-speech (TTS) and speech-to-speech translation (S2ST) while retaining the original speaker’s voice characteristics. The authors introduce VALL-E X, a cross-lingual neural codec LLM, which builds upon the previously established VALL-E framework, designed for monolingual TTS synthesis.

Framework and Methodology

The VALL-E X framework is structured around two primary components: a multilingual autoregressive codec LLM and a multilingual non-autoregressive codec LLM. These components leverage large-scale multilingual speech-transcription datasets, such as LibriLight and WenetSpeech, enabling the generation of high-quality speech that preserves the original speaker's voice, emotion, and acoustic characteristics across different languages.

A core aspect of the VALL-E X model is its ability to perform zero-shot cross-lingual TTS and S2ST. For cross-lingual TTS, the model predicts the target language's acoustic tokens using the source language's speech as a reference. The unique architecture of VALL-E X allows it to successfully generate speech in the target language with a foreign speaker's voice without requiring parallel multilingual datasets with the same speaker.

The speech-to-speech translation task is achieved through a novel speech recognition and translation model based on an improved version of SpeechUT. This approach translates semantic content across languages, integrating phonemic and acoustic token sequences to produce the desired output.

Experimental Evaluation

In an extensive set of experiments, VALL-E X was evaluated on tasks that include zero-shot cross-lingual TTS and zero-shot S2ST, using datasets like LibriSpeech and the EMIME corpus. The evaluation was based on various performance metrics including speaker similarity, ASR-Word Error Rate (WER), ASR-BLEU, and overall speech naturalness.

The results demonstrate that VALL-E X delivers substantial improvements over existing methods, particularly in maintaining speaker similarity and reducing WER and accent problems. For instance, VALL-E X achieved a significant reduction in WER in cross-lingual English TTS tasks and outperformed state-of-the-art models in S2ST tasks in terms of BLEU scores and speech naturalness.

Implications and Future Directions

The implications of VALL-E X are profound in both practical and theoretical aspects of speech synthesis. Practically, the model facilitates applications like personalized virtual assistants and multilingual communication aids, enhancing their ability to convey messages across languages without losing individual speaker identity. Theoretically, this work pushes the boundaries of what neural models can achieve in the field of multilingual text and speech synthesis.

Moreover, this research sets the stage for future work that might expand VALL-E X to additional languages and explore more intricate tasks such as emotion expression and code-switching in synthesized speech. These extensions can lead to even more robust and versatile applications in AI-driven communication systems.

Overall, VALL-E X embodies a significant advancement in cross-lingual speech processing, providing a robust framework to achieve seamless and high-fidelity speech synthesis across languages while retaining speaker individuality. The potential for future exploration and application is vast, heralding new opportunities for research and development within the field.