Papers
Topics
Authors
Recent
Search
2000 character limit reached

OpenVoice: Versatile Instant Voice Cloning

Published 3 Dec 2023 in cs.SD, cs.LG, and eess.AS | (2312.01479v6)

Abstract: We introduce OpenVoice, a versatile voice cloning approach that requires only a short audio clip from the reference speaker to replicate their voice and generate speech in multiple languages. OpenVoice represents a significant advancement in addressing the following open challenges in the field: 1) Flexible Voice Style Control. OpenVoice enables granular control over voice styles, including emotion, accent, rhythm, pauses, and intonation, in addition to replicating the tone color of the reference speaker. The voice styles are not directly copied from and constrained by the style of the reference speaker. Previous approaches lacked the ability to flexibly manipulate voice styles after cloning. 2) Zero-Shot Cross-Lingual Voice Cloning. OpenVoice achieves zero-shot cross-lingual voice cloning for languages not included in the massive-speaker training set. Unlike previous approaches, which typically require extensive massive-speaker multi-lingual (MSML) dataset for all languages, OpenVoice can clone voices into a new language without any massive-speaker training data for that language. OpenVoice is also computationally efficient, costing tens of times less than commercially available APIs that offer even inferior performance. To foster further research in the field, we have made the source code and trained model publicly accessible. We also provide qualitative results in our demo website. OpenVoice has been used by more than 2M users worldwide as the voice engine of MyShell.ai

Definition Search Book Streamline Icon: https://streamlinehq.com
References (18)
  1. I. P. Association. Handbook of the International Phonetic Association: A guide to the use of the International Phonetic Alphabet. Cambridge University Press, 1999.
  2. Yourtts: Towards zero-shot multi-speaker tts and zero-shot voice conversion for everyone. In International Conference on Machine Learning, pages 2709–2720. PMLR, 2022.
  3. CoquiAI. Xtts taking text-to-speech to the next level. Technical Blog, 2023.
  4. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460, 2021.
  5. Glow-tts: A generative flow for text-to-speech via monotonic alignment search. Advances in Neural Information Processing Systems, 33:8067–8077, 2020.
  6. Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In International Conference on Machine Learning, pages 5530–5540. PMLR, 2021.
  7. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in Neural Information Processing Systems, 33:17022–17033, 2020.
  8. Voicebox: Text-guided multilingual universal speech generation at scale. arXiv preprint arXiv:2306.15687, 2023.
  9. Freevc: Towards high-quality text-free one-shot voice conversion. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.
  10. M. Müller. Dynamic time warping. Information retrieval for music and motion, pages 69–84, 2007.
  11. Speech resynthesis from discrete disentangled self-supervised representations. arXiv preprint arXiv:2104.00355, 2021.
  12. D. Rezende and S. Mohamed. Variational inference with normalizing flows. In International conference on machine learning, pages 1530–1538. PMLR, 2015.
  13. P. Senin. Dynamic time warping algorithm review. Information and Computer Science Department University of Hawaii at Manoa Honolulu, USA, 855(1-23):40, 2008.
  14. A comparison of discrete and soft speech units for improved voice conversion. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6562–6566. IEEE, 2022.
  15. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  16. Neural codec language models are zero-shot text to speech synthesizers. arXiv preprint arXiv:2301.02111, 2023.
  17. Instructtts: Modelling expressive tts in discrete latent space with natural language style prompt. arXiv preprint arXiv:2301.13662, 2023.
  18. Speak foreign languages with your own voice: Cross-lingual neural codec language modeling. arXiv preprint arXiv:2303.03926, 2023.
Citations (16)

Summary

  • The paper presents a decoupled framework that separates tone color from language and style, enabling accurate instant voice cloning.
  • The methodology integrates a base TTS model with a tone color converter using normalizing flow layers to ensure natural sound and cross-lingual adaptability.
  • Experimental results show up to 12x real-time performance and effective manipulation of diverse speech styles across multiple languages and accents.

OpenVoice: Versatile Instant Voice Cloning

OpenVoice proposes a method to achieve instant voice cloning with granular control over voice styles and zero-shot cross-lingual capabilities. The paper presents a framework that enables both tone color cloning and the control of various style parameters, such as emotion and intonation, independent of the reference speaker's original speech characteristics.

Introduction

The paper introduces OpenVoice as a response to the limitations seen in existing instant voice cloning (IVC) and text-to-speech (TTS) methods. Current approaches often struggle with flexible voice style manipulation and require extensive multilingual datasets for cross-lingual voice cloning. OpenVoice addresses these issues by decoupling the components of voice—such as language, tone color, and style—allowing for flexible manipulation and cross-lingual capabilities without large training datasets.

Approach

OpenVoice's technical approach is based on decomposing IVC into manageable subtasks. The method employs a base speaker TTS model to handle language and style controls, while a separate tone color converter ensures the tone color of the reference speaker is transferred to the generated speech.

Model Structure

The OpenVoice framework consists of two primary components:

  • Base Speaker TTS Model: This component handles style and language manipulation using a single-speaker or multi-speaker model, such as VITS, to produce speech in desired styles and languages.
  • Tone Color Converter: This component utilizes an encoder-decoder structure with normalizing flow layers to map the features of the base speaker’s speech to those of a reference speaker, thus transferring tone color without altering other style characteristics. Figure 1

    Figure 1: Illustration of the OpenVoice framework. We use a base speaker model to control the styles and languages, and a converter to embody the tone color of the reference speaker into the speech.

The inclusion of a well-structured phoneme representation, such as the International Phonetic Alphabet (IPA), facilitates seamless cross-lingual generalization by promoting language-neutral processing.

Training Methodology

The training of OpenVoice involves collecting substantial audio datasets encompassing various languages and speakers. The base speaker TTS model is trained on a mixed-language dataset, while the tone color converter is trained using a diverse, multi-speaker multilingual dataset. The integration of IPA encoding ensures phonetic consistency across languages, critical for zero-shot cross-lingual capabilities.

The training objectives focus on ensuring natural sound production, eliminating tone color information, and minimizing KL-divergence losses between phonetic and feature representations. This structured approach ensures effective tone color conversion while preserving other vocal characteristics.

Experimental Results

The evaluation of OpenVoice presents strong qualitative performance across multiple aspects:

  • Tone Color Cloning: OpenVoice accurately clones the tone color from a wide range of voices, including those not present in the training dataset.
  • Style Flexibility: The system successfully preserves and manipulates various speech styles, such as emotion and intonation, across different languages and accents.
  • Cross-Lingual Capabilities: OpenVoice facilitates voice cloning in new languages with little to no specific language data, indicating robust cross-lingual generalization.

The model demonstrates efficient inference speeds, achieving up to 12×12\times real-time performance on GPUs, with potential improvements supporting even faster processing.

Discussion and Conclusions

OpenVoice exemplifies a notable advancement in voice cloning through its decoupled framework that separates tone color from other voice parameters. This strategic separation allows for granular style control, efficient cross-lingual processes, and computational efficiency. With public access to the source code and model weights, OpenVoice provides a valuable resource for further research in voice cloning technologies.

The paper concludes that by decoupling critical aspects of speech generation, OpenVoice extends the possibilities of voice cloning beyond existing constraints, offering a robust tool for diverse applications in multimedia and AI communications.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 27 tweets with 1159 likes about this paper.

HackerNews

  1. OpenVoice: Versatile Instant Voice Cloning (397 points, 190 comments) 
  2. OpenVoice: Instant Voice Cloning (268 points, 152 comments) 
  3. OpenVoice V2 Released (3 points, 0 comments)