OpenVoice: Versatile Instant Voice Cloning

Published 3 Dec 2023 in cs.SD, cs.LG, and eess.AS | (2312.01479v6)

Abstract: We introduce OpenVoice, a versatile voice cloning approach that requires only a short audio clip from the reference speaker to replicate their voice and generate speech in multiple languages. OpenVoice represents a significant advancement in addressing the following open challenges in the field: 1) Flexible Voice Style Control. OpenVoice enables granular control over voice styles, including emotion, accent, rhythm, pauses, and intonation, in addition to replicating the tone color of the reference speaker. The voice styles are not directly copied from and constrained by the style of the reference speaker. Previous approaches lacked the ability to flexibly manipulate voice styles after cloning. 2) Zero-Shot Cross-Lingual Voice Cloning. OpenVoice achieves zero-shot cross-lingual voice cloning for languages not included in the massive-speaker training set. Unlike previous approaches, which typically require extensive massive-speaker multi-lingual (MSML) dataset for all languages, OpenVoice can clone voices into a new language without any massive-speaker training data for that language. OpenVoice is also computationally efficient, costing tens of times less than commercially available APIs that offer even inferior performance. To foster further research in the field, we have made the source code and trained model publicly accessible. We also provide qualitative results in our demo website. OpenVoice has been used by more than 2M users worldwide as the voice engine of MyShell.ai

Abstract PDF HTML Upgrade to Chat

References (18)

Citations (16)

View on Semantic Scholar

Summary

The paper presents a decoupled framework that separates tone color from language and style, enabling accurate instant voice cloning.
The methodology integrates a base TTS model with a tone color converter using normalizing flow layers to ensure natural sound and cross-lingual adaptability.
Experimental results show up to 12x real-time performance and effective manipulation of diverse speech styles across multiple languages and accents.

OpenVoice: Versatile Instant Voice Cloning

OpenVoice proposes a method to achieve instant voice cloning with granular control over voice styles and zero-shot cross-lingual capabilities. The paper presents a framework that enables both tone color cloning and the control of various style parameters, such as emotion and intonation, independent of the reference speaker's original speech characteristics.

Introduction

The paper introduces OpenVoice as a response to the limitations seen in existing instant voice cloning (IVC) and text-to-speech (TTS) methods. Current approaches often struggle with flexible voice style manipulation and require extensive multilingual datasets for cross-lingual voice cloning. OpenVoice addresses these issues by decoupling the components of voice—such as language, tone color, and style—allowing for flexible manipulation and cross-lingual capabilities without large training datasets.

Approach

OpenVoice's technical approach is based on decomposing IVC into manageable subtasks. The method employs a base speaker TTS model to handle language and style controls, while a separate tone color converter ensures the tone color of the reference speaker is transferred to the generated speech.

Model Structure

The OpenVoice framework consists of two primary components:

Base Speaker TTS Model: This component handles style and language manipulation using a single-speaker or multi-speaker model, such as VITS, to produce speech in desired styles and languages.
Tone Color Converter: This component utilizes an encoder-decoder structure with normalizing flow layers to map the features of the base speaker’s speech to those of a reference speaker, thus transferring tone color without altering other style characteristics.
Figure 1: Illustration of the OpenVoice framework. We use a base speaker model to control the styles and languages, and a converter to embody the tone color of the reference speaker into the speech.

The inclusion of a well-structured phoneme representation, such as the International Phonetic Alphabet (IPA), facilitates seamless cross-lingual generalization by promoting language-neutral processing.

Training Methodology

The training of OpenVoice involves collecting substantial audio datasets encompassing various languages and speakers. The base speaker TTS model is trained on a mixed-language dataset, while the tone color converter is trained using a diverse, multi-speaker multilingual dataset. The integration of IPA encoding ensures phonetic consistency across languages, critical for zero-shot cross-lingual capabilities.

The training objectives focus on ensuring natural sound production, eliminating tone color information, and minimizing KL-divergence losses between phonetic and feature representations. This structured approach ensures effective tone color conversion while preserving other vocal characteristics.

Experimental Results

The evaluation of OpenVoice presents strong qualitative performance across multiple aspects:

Tone Color Cloning: OpenVoice accurately clones the tone color from a wide range of voices, including those not present in the training dataset.
Style Flexibility: The system successfully preserves and manipulates various speech styles, such as emotion and intonation, across different languages and accents.
Cross-Lingual Capabilities: OpenVoice facilitates voice cloning in new languages with little to no specific language data, indicating robust cross-lingual generalization.

The model demonstrates efficient inference speeds, achieving up to $12\times$ real-time performance on GPUs, with potential improvements supporting even faster processing.

Discussion and Conclusions

OpenVoice exemplifies a notable advancement in voice cloning through its decoupled framework that separates tone color from other voice parameters. This strategic separation allows for granular style control, efficient cross-lingual processes, and computational efficiency. With public access to the source code and model weights, OpenVoice provides a valuable resource for further research in voice cloning technologies.

The paper concludes that by decoupling critical aspects of speech generation, OpenVoice extends the possibilities of voice cloning beyond existing constraints, offering a robust tool for diverse applications in multimedia and AI communications.

Markdown Report Issue