Multi-modal Adversarial Training for Zero-Shot Voice Cloning

Published 28 Aug 2024 in eess.AS, cs.LG, and cs.SD | (2408.15916v1)

Abstract: A text-to-speech (TTS) model trained to reconstruct speech given text tends towards predictions that are close to the average characteristics of a dataset, failing to model the variations that make human speech sound natural. This problem is magnified for zero-shot voice cloning, a task that requires training data with high variance in speaking styles. We build off of recent works which have used Generative Advsarial Networks (GAN) by proposing a Transformer encoder-decoder architecture to conditionally discriminates between real and generated speech features. The discriminator is used in a training pipeline that improves both the acoustic and prosodic features of a TTS model. We introduce our novel adversarial training technique by applying it to a FastSpeech2 acoustic model and training on Libriheavy, a large multi-speaker dataset, for the task of zero-shot voice cloning. Our model achieves improvements over the baseline in terms of speech quality and speaker similarity. Audio examples from our system are available online.

Abstract PDF HTML Upgrade to Chat

Citations (2)

View on Semantic Scholar

Summary

The paper introduces a transformer encoder-decoder with GAN-style adversarial training that enhances zero-shot voice cloning performance.
It integrates adversarial techniques in FastSpeech2 to capture both acoustic and prosodic features, resulting in improved speech quality.
The model, trained on the diverse Libriheavy dataset, significantly boosts speaker similarity and handles varied vocal styles compared to baselines.

The paper "Multi-modal Adversarial Training for Zero-Shot Voice Cloning" explores the challenges and innovations in the domain of zero-shot voice cloning, particularly focusing on addressing the limitations of text-to-speech (TTS) systems that tend to produce average-sounding speech, losing the natural variations of human voice. This challenge becomes more pronounced in zero-shot scenarios where high variance in speaking styles is required to replicate unseen voices effectively.

Key Contributions

Transformer Encoder-Decoder Architecture: The authors propose a new architecture that incorporates a Transformer-based encoder-decoder framework specifically designed to differentiate between real and synthesized speech features. This aspect is crucial for effectively capturing nuanced variations in voice characteristics.
Adversarial Training Technique: Building on recent advances using Generative Adversarial Networks (GANs), the paper introduces an adversarial training pipeline that targets both acoustic and prosodic features. This technique helps the model generate speech that is not only higher in quality but also more closely resembles the target speaker's style.
Application to FastSpeech2: The proposed adversarial training methodology is applied to a FastSpeech2 acoustic model. FastSpeech2 is known for its efficiency in generating high-quality speech, and the introduction of adversarial elements enhances its capacity for zero-shot voice cloning.
Training on Libriheavy Dataset: The model is trained on Libriheavy, a comprehensive multi-speaker dataset. This dataset provides the diversity necessary for testing the model's ability to handle varied speaking styles, which is critical for zero-shot applications.

Results and Implications

Improvement in Speech Quality and Speaker Similarity: The model shows significant improvements over baseline systems in terms of both speech quality and the ability to mimic speaker characteristics. This suggests that the multi-modal adversarial approach effectively addresses the deficiencies of conventional TTS models in zero-shot contexts.
Practical Demonstrations: The paper provides audio examples to demonstrate the efficacy of their system, allowing researchers and practitioners to assess the quality and speaker similarity enhancements achieved through the proposed method.

This work highlights the importance of integrating advanced adversarial techniques into TTS models to overcome traditional limitations, especially in complex tasks like zero-shot voice cloning where high variability and speaker fidelity are crucial. The contributions pave the way for more realistic and versatile voice synthesis systems.