Diffusion-Based Voice Conversion with Fast Maximum Likelihood Sampling Scheme

Published 28 Sep 2021 in cs.SD, cs.LG, and stat.ML | (2109.13821v2)

Abstract: Voice conversion is a common speech synthesis task which can be solved in different ways depending on a particular real-world scenario. The most challenging one often referred to as one-shot many-to-many voice conversion consists in copying the target voice from only one reference utterance in the most general case when both source and target speakers do not belong to the training dataset. We present a scalable high-quality solution based on diffusion probabilistic modeling and demonstrate its superior quality compared to state-of-the-art one-shot voice conversion approaches. Moreover, focusing on real-time applications, we investigate general principles which can make diffusion models faster while keeping synthesis quality at a high level. As a result, we develop a novel Stochastic Differential Equations solver suitable for various diffusion model types and generative tasks as shown through empirical studies and justify it by theoretical analysis.

Abstract PDF Upgrade to Chat

Citations (106)

View on Semantic Scholar

Summary

The paper introduces a diffusion probabilistic model that achieves one-shot many-to-many voice conversion using a single target utterance.
It establishes superior performance against state-of-the-art methods by leveraging a novel stochastic differential equations solver.
The research optimizes conversion speed for real-time applications while maintaining high-fidelity voice outputs.

The paper "Diffusion-Based Voice Conversion with Fast Maximum Likelihood Sampling Scheme" addresses the challenge of one-shot many-to-many voice conversion. This complex task involves converting a source speaker's voice to mimic a target speaker, using only a single utterance from the target speaker and without both speakers participating in the training dataset.

Key Contributions

Diffusion Probabilistic Modeling: The authors leverage diffusion probabilistic models to perform high-quality voice conversion. These models offer a scalable solution, which is critical for handling diverse and dynamic voice datasets.
Comparison to State-of-the-Art: The proposed method demonstrates superior performance compared to existing one-shot voice conversion approaches. This is indicative of the potential of diffusion models in producing natural and convincing voice outputs.
Real-Time Applications and Efficiency: A significant focus of this paper is on optimizing the diffusion model for real-time applications. The authors explore various strategies to enhance the speed of these models without compromising quality.
Novel Stochastic Differential Equations Solver: The development of a new solver for Stochastic Differential Equations (SDEs) is a notable advancement. This solver is versatile, applicable to different diffusion model types and generative tasks, supported by both empirical results and theoretical analysis.

The work showcases that diffusion models, when appropriately optimized, are not only capable of delivering high-fidelity voice conversion but are also adaptable for real-time applications, making them suitable for practical deployment in various speech synthesis scenarios. This research bridges the gap between theoretical model development and real-world application needs, particularly in scenarios lacking extensive speaker data.