Low-latency Real-time Voice Conversion on CPU

Published 1 Nov 2023 in cs.SD, cs.LG, and eess.AS | (2311.00873v1)

Abstract: We adapt the architectures of previous audio manipulation and generation neural networks to the task of real-time any-to-one voice conversion. Our resulting model, LLVC ($\textbf{L}$ow-latency $\textbf{L}$ow-resource $\textbf{V}$oice $\textbf{C}$onversion), has a latency of under 20ms at a bitrate of 16kHz and runs nearly 2.8x faster than real-time on a consumer CPU. LLVC uses both a generative adversarial architecture as well as knowledge distillation in order to attain this performance. To our knowledge LLVC achieves both the lowest resource usage as well as the lowest latency of any open-source voice conversion model. We provide open-source samples, code, and pretrained model weights at https://github.com/KoeAI/LLVC.

Abstract PDF Upgrade to Chat

Citations (1)

View on Semantic Scholar

Summary

The paper presents the LLVC model that achieves real-time any-to-one voice conversion on CPUs with latency under 20 milliseconds.
It employs a GAN-based architecture with knowledge distillation and adapts the Waveformer framework using dilated convolutions and masked transformers.
Experimental results show superior latency and quality performance, enabling real-time voice conversion on low-resource devices.

Low-latency Real-time Voice Conversion on CPU: An Overview

The paper "Low-latency Real-time Voice Conversion on CPU" presents a novel approach to real-time any-to-one voice conversion, leveraging the architectures of established audio manipulation neural networks. The central contribution is the LLVC model, which is designed to operate with low latency and low resource usage, running efficiently on consumer-grade CPUs.

Methodology

The LLVC model employs a generative adversarial network (GAN) architecture combined with knowledge distillation techniques to achieve efficient performance. This combination allows the model to deliver low-latency voice conversion with a latency of under 20 milliseconds at a 16kHz bitrate, and a speed 2.8 times faster than real-time on a typical CPU. The architecture is based on the Waveformer framework, adapted to minimize the perceptible difference between the converted output and the target speaker's voice.

The core technical contribution lies in the adaptation of Waveformer’s encoder-decoder architecture, initially developed for sound extraction, to the task of voice conversion. The model utilizes dilated causal convolutions (DCC) and a masked transformer to facilitate streaming applications, ensuring the model minimally relies on future input context. The inclusion of model distillation enhances its computational efficiency, using a larger teacher model to inform the training of a streamlined student model.

Experimental Setup

The research employs the LibriSpeech dataset, specifically focusing on an any-to-one conversion to a single target speaker. An artificial parallel dataset is generated using an RVC v2 model, effectively simulating the voice of the target speaker across diverse inputs. Training is executed on a single RTX 3090 GPU over an extensive number of steps, with hyperparameters optimized for stability.

Results

The LLVC model demonstrates superior performance in both latency and real-time factor in comparison to traditional models such as No-F0 RVC and QuickVC. The results indicate that LLVC achieves a significant reduction in end-to-end latency while maintaining high-quality output, as evidenced by subjective evaluations like Mean Opinion Scores (MOS) for naturalness and similarity. Notably, LLVC variants incorporating different network architectures further validate these findings, offering insights into architectural trade-offs.

Implications and Future Work

The practical implications of this study are significant, particularly for applications demanding real-time voice conversion on devices lacking powerful computational hardware, such as mobile phones and laptops. The open-source nature of the model provides a valuable resource for practitioners aiming to deploy low-latency voice conversion solutions in diverse settings.

The study opens multiple avenues for future exploration. Incorporating multilingual and noisy datasets could enhance the model’s generalization capabilities, enabling robust performance across diverse linguistic contexts. Additionally, focusing on personalized voice conversion through fine-tuning on specific speaker datasets could tailor the model to individual needs.

This research contributes a significant advancement in voice conversion, particularly through its alignment with real-time and low-resource operational constraints, fostering applications across personal, creative, and professional domains.

Markdown Report Issue