StreamVC: Real-Time Low-Latency Voice Conversion
Abstract: We present StreamVC, a streaming voice conversion solution that preserves the content and prosody of any source speech while matching the voice timbre from any target speech. Unlike previous approaches, StreamVC produces the resulting waveform at low latency from the input signal even on a mobile platform, making it applicable to real-time communication scenarios like calls and video conferencing, and addressing use cases such as voice anonymization in these scenarios. Our design leverages the architecture and training strategy of the SoundStream neural audio codec for lightweight high-quality speech synthesis. We demonstrate the feasibility of learning soft speech units causally, as well as the effectiveness of supplying whitened fundamental frequency information to improve pitch stability without leaking the source timbre information.
- Kou Tanaka Nobukatsu Hojo Takuhiro Kaneko, Hirokazu Kameoka, “StarGAN-VC2: Rethinking Conditional Methods for StarGAN-Based Voice Conversion,” in INTERSPEECH 2019.
- “AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss,” in ICLR 2019.
- “VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised Speech Representation Disentanglement for One-shot Voice Conversion,” in INTERSPEECH 2021.
- “One-Shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization,” in INTERSPEECH 2019.
- “Streaming Voice Conversion via Intermediate Bottleneck Features and Non-Streaming Teacher Guidance,” in ICASSP 2023.
- “Any-to-Many Voice Conversion With Location-Relative Sequence-to-Sequence Modeling,” IEEE/ACM Transactions on ASLP, 2021.
- “A Comparison of Discrete and Soft Speech Units for Improved Voice Conversion,” in ICASSP 2022.
- “FreeVC: Towards High-Quality Text-Free One-Shot Voice Conversion,” in ICASSP 2023.
- “QuickVC: Any-to-many Voice Conversion Using Inverse Short-time Fourier Transform for Faster Conversion,” in arXiv:2302.08296; 2023/02/23.
- “Any-to-Any Voice Conversion with F0 and Timbre Disentanglement and Novel Timbre Conditioning,” in ICASSP 2023.
- “HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units,” IEEE/ACM Transactions on ASLP, 2021.
- “WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing,” IEEE Journal of Selected Topics in Signal Processing, 2022.
- “Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech,” in ICML 2021.
- “HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis,” in NeurIPS 2020.
- “MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis,” in NeurIPS 2019. Curran Associates, Inc.
- “SoundStream: An End-to-End Neural Audio Codec,” IEEE/ACM Transactions on ASLP, 2022.
- “FiLM: Visual Reasoning with a General Conditioning Layer,” in AAAI 2018.
- “YIN, a fundamental frequency estimator for speech and music,” Journal of the Acoustical Society of America, 2002.
- “Real-Time Speech Frequency Bandwidth Extension,” in ICASSP 2021.
- “Streaming Keyword Spotting on Mobile Devices,” in INTERSPEECH 2020.
- “XNNPACK,” https://github.com/google/XNNPACK, Accessed: 2023-08-30.
- “Diffusion-Based Voice Conversion with Fast Maximum Likelihood Sampling Scheme,” in ICLR 2022.
- “LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech,” INTERSPEECH 2019.
- “DNSMOS: A Non-Intrusive Perceptual Objective Speech Quality Metric to Evaluate Noise Suppressors,” in ICASSP 2021.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.