Real-Time and Accurate: Zero-shot High-Fidelity Singing Voice Conversion with Multi-Condition Flow Synthesis

Published 23 May 2024 in eess.AS | (2405.15093v2)

Abstract: Singing voice conversion is to convert the source singing voice into the target singing voice except for the content. Currently, flow-based models can complete the task of voice conversion, but they struggle to effectively extract latent variables in the more rhythmically rich and emotionally expressive task of singing voice conversion, while also facing issues with low efficiency in speech processing. In this paper, we propose a high-fidelity flow-based model based on multi-decoupling feature constraints called RASVC, which enhances the capture of vocal details by integrating multiple latent attribute encoders. We also use Multi-stream inverse short-time Fourier transform(MS-iSTFT) to enhance the speed of speech processing by skipping some complicated decoder processing steps. We compare the synthesized singing voice with other models from multiple dimensions, and our proposed model is highly consistent with the current state-of-the-art, with the demo which is available at \url{https://lazycat1119.github.io/RASVC-demo/}.

Abstract PDF HTML Upgrade to Chat

References (7)

Citations (1)

View on Semantic Scholar

Summary

The paper introduces RASVC, a multi-condition flow-based SVC model that separates pitch, emotion, and content for high-fidelity singing voice conversion in real-time.
It employs specialized encoders and an MS-iSTFT decoder to optimize feature extraction and synthesis efficiency, enhancing naturalness and voice similarity.
Experimental evaluations on multiple datasets show competitive performance with a MOS of 4.14 and efficient real-time processing.

Real-Time and Accurate: Zero-shot High-Fidelity Singing Voice Conversion with Multi-Condition Flow Synthesis

Introduction

The paper "Real-Time and Accurate: Zero-shot High-Fidelity Singing Voice Conversion with Multi-Condition Flow Synthesis" introduces a method for singing voice conversion (SVC) that achieves high fidelity and efficiency by leveraging a flow-based model architecture enhanced with multi-decoupling feature constraints. The researchers focus on overcoming the hurdles associated with traditional SVC methodologies, primarily the challenges in capturing rhythmically rich expressions and processing speech efficiently.

Methodology

Model Architecture

The proposed model, termed RASVC, builds upon the foundational VITS model architecture by introducing multiple specialized encoders and an MS-iSTFT decoder, specifically tailored for the SVC task. The architecture, depicted in Figure 1, comprises the following components:

F0 Encoder: Determines the fundamental frequency for voice modulation.
Emotion Encoder: Captures the emotive aspects of the singing voice.
Speaker Encoder: Learns and preserves speaker identity.
Content Encoder: Utilizes the HuBERT-Soft model to extract rich content features from the input voice.

The integration of a multi-condition flow model allows for the inclusion of additional attributes such as pitch and emotion, maximizing the naturalness and expressiveness of the convolved output.

Figure 1: Overview of model structure.

Multi-Condition Synthesis

The RASVC employs a multi-condition encoder system to transform input features into a highly expressive latent representation. This improved feature extraction methodology supports a more nuanced representation of voice characteristics, separating content, timbre, pitch, and emotion. The transformation uses a normalizing flow approach to maximize log-likelihood directly.

Moreover, the model simplifies the synthesis pipeline with an MS-iSTFT decoder, which directly converts frequency domain features into time domain waveforms, thereby reducing processing time without sacrificing fidelity.

Experimental Setup

Datasets and Pre-training

The model development process involved training on multiple datasets, starting with VCTK for initial weight setting and fine-tuning on Opensinger. Ultimately, the model was validated on M4singer and NUS-48E datasets for zero-shot and cross-domain SVC assessment. This approach ensured the model's capacity to generalize across different styles and languages.

Evaluation Metrics

Performance metrics included Mean Opinion Score (MOS) for naturalness and similarity, Perceptual Evaluation of Speech Quality (PESQ), voice similarity, and real-time factor (RTF). These metrics provide a robust framework for assessing fidelity, expressiveness, and computational efficiency of the voice conversion process.

Results

The empirical results demonstrate that the RASVC surpasses several baseline models in terms of voice similarity and MOS similarity when tested across different domains and languages. However, the PESQ metric, while favorable, indicated some potential areas for refinement in content retention under varied conditions.

Notably, RASVC achieved competitive performance metrics, such as a naturalness score of 4.14 and a voice similarity score of 0.603, demonstrating the effectiveness of multi-condition embedding strategies. The RTF measurements further highlight its efficiency, providing a functional tool for real-time applications.

Figure 2: t-SNE visualization of converted songs.

Figure 3: Visualization analysis of Spectrogram.

Conclusion

The research presented offers substantial contributions to the SVC field by deploying a multi-condition flow-based model that integrates multiple decoupling constraints, significantly enhancing the expressiveness and fidelity of singing voice conversion. With its efficient processing strategies and robust experimental outcomes, the RASVC model holds promise for diverse singing voice conversion applications, potentially influencing future development trajectories in voice synthesis and processing technologies. The release of a demo and a Pytorch training framework further supports ongoing research and innovation in this domain.