VoiceShop: A Unified Speech-to-Speech Framework for Identity-Preserving Zero-Shot Voice Editing
Abstract: We present VoiceShop, a novel speech-to-speech framework that can modify multiple attributes of speech, such as age, gender, accent, and speech style, in a single forward pass while preserving the input speaker's timbre. Previous works have been constrained to specialized models that can only edit these attributes individually and suffer from the following pitfalls: the magnitude of the conversion effect is weak, there is no zero-shot capability for out-of-distribution speakers, or the synthesized outputs exhibit undesirable timbre leakage. Our work proposes solutions for each of these issues in a simple modular framework based on a conditional diffusion backbone model with optional normalizing flow-based and sequence-to-sequence speaker attribute-editing modules, whose components can be combined or removed during inference to meet a wide array of tasks without additional model finetuning. Audio samples are available at \url{https://voiceshopai.github.io}.
- StyleFlow: Attribute-Conditioned Exploration of StyleGAN-Generated Images using Conditional Continuous Normalizing Flows. ACM Transactions on Graphics (ToG), 40(3):1–21, 2021.
- SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 5723–5738, 2022.
- Common Voice: A Massively-Multilingual Speech Corpus. arXiv preprint arXiv:1912.06670, 2019.
- Multilingual Multiaccented Multispeaker TTS with RADTTS, 2023.
- wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations, 2020.
- data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language, 2022.
- Location-Relative Attention Mechanisms For Robust Long-Form Speech Synthesis. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6194–6198. IEEE, 2020.
- Invisible Watermarking for Audio Generation Diffusion Models, 2023.
- YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for Everyone. In International Conference on Machine Learning, pp. 2709–2720. PMLR, 2022.
- SpeechSplit 2.0: Unsupervised Speech Disentanglement for Voice Conversion Without Tuning Autoencoder Bottlenecks, 2022.
- SpeechStew: Simply Mix All Available Speech Recognition Data to Train One Large Neural Network, 2021.
- WavMark: Watermarking for Audio Generation, 2024.
- Neural Ordinary Differential Equations. Advances in neural information processing systems, 31, 2018.
- WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing. IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518, 2022.
- NANSY++: Unified Voice Synthesis with Neural Analysis and Synthesis. In The Eleventh International Conference on Learning Representations, 2022.
- DDDM-VC: Decoupled Denoising Diffusion Models with Disentangled Representation and Prior Mixup for Verified Robust Voice Conversion, 2023.
- High Fidelity Neural Audio Compression. arXiv preprint arXiv:2210.13438, 2022.
- ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification. arXiv preprint arXiv:2005.07143, 2020.
- Remap, Warp and Attend: Non-Parallel Many-to-Many Accent Conversion with Normalizing Flows, 2022.
- Fant, G. Acoustic Theory of Speech Production: With Calculations Based on X-Ray Studies of Russian Articulations. D A C S R Series. De Gruyter Mouton, 1971. ISBN 9789027916006. URL https://books.google.com/books?id=qa-AUPdWg6sC.
- Domain-Adversarial Training of Neural Networks, 2016.
- A Neural Algorithm of Artistic Style, 2015.
- Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. In Proceedings of the 23rd International Conference on Machine Learning, ICML ’06, pp. 369–376, New York, NY, USA, 2006. Association for Computing Machinery. ISBN 1595933832. doi: 10.1145/1143844.1143891.
- Interpretable Style Transfer for Text-to-Speech with ControlVAE and Diffusion Bridge, 2023.
- Conformer: Convolution-Augmented Transformer for Speech Recognition. In Interspeech 2020. ISCA, October 2020. doi: 10.21437/interspeech.2020-3015.
- Audio Deepfake Detection with Self-Supervised WavLM and Multi-Fusion Attentive Classifier, 2024.
- Solving Ordinary Differential Equations I: Nonstiff Problems. Springer Series in Computational Mathematics. Springer Berlin Heidelberg, 2008. ISBN 9783540566700.
- Deep Residual Learning for Image Recognition, 2015.
- Denoising Diffusion Probabilistic Models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
- Accelerating Continuous Normalizing Flow with Trajectory Polynomial Regularization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp. 7832–7839, 2021.
- MuLan: A Joint Embedding of Music Audio and Natural Language. In ISMIR 2022 Hybrid Conference, 2022.
- Perceiver: General Perception with Iterative Attention. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 4651–4664. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr.press/v139/jaegle21a.html.
- Zero-Shot Accent Conversion using Pseudo Siamese Disentanglement Network, 2023.
- Voice-Preserving Zero-Shot Multiple Accent Conversion, 2023.
- Collaborative Watermarking for Adversarial Speech Synthesis, 2024.
- One Model To Learn Them All, 2017.
- CopyCat2: A Single Model for Multi-Speaker TTS and Many-to-Many Fine-Grained Prosody Transfer, 2022.
- UnitSpeech: Speaker-Adaptive Speech Synthesis with Untranscribed Data, 2023.
- Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech, 2021.
- Adam: A Method for Stochastic Optimization. arXiv preprint arXiv:1412.6980, 2014.
- HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis, 2020a.
- DiffWave: A Versatile Diffusion Model for Audio Synthesis. arXiv preprint arXiv:2009.09761, 2020b.
- Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- Multimodal Foundation Models: From Specialists to General-Purpose Assistants. ArXiv, abs/2309.10020, 2023. URL https://api.semanticscholar.org/CorpusID:262055614.
- FreeVC: Towards High-Quality Text-Free One-Shot Voice Conversion, 2022a.
- Cross-Speaker Emotion Transfer Based on Prosody Compensation for End-to-End Speech Synthesis, 2022b.
- StyleTTS-VC: One-Shot Voice Conversion by Knowledge Transfer from Style-Based TTS Models, 2022c.
- Detecting Voice Cloning Attacks via Timbre Watermarking, 2023a.
- AudioLDM 2: Learning Holistic Audio Generation with Self-Supervised Pretraining. arXiv preprint arXiv:2308.05734, 2023b.
- Decoupled Weight Decay Regularization, 2019.
- Citrinet: Closing the Gap between Non-Autoregressive and Autoregressive End-to-End Models for Automatic Speech Recognition, 2021.
- DINO-VITS: Data-Efficient Noise-Robust Zero-Shot Voice Cloning via Multi-Tasking with Self-Supervised Speaker Verification Loss, 2023.
- SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition. In Interspeech 2019. ISCA, September 2019. doi: 10.21437/interspeech.2019-2680.
- Towards Disentangled Speech Representations, 2022.
- Speech Resynthesis from Discrete Disentangled Self-Supervised Representations. arXiv preprint arXiv:2104.00355, 2021.
- Diffusion-Based Voice Conversion with Fast Maximum Likelihood Sampling Scheme. In International Conference on Learning Representations, 2021.
- AUTOVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss, 2019.
- Variational Inference with Normalizing Flows, 2016.
- Proactive Detection of Voice Cloning with Localized Watermarking, 2024.
- U-Net: Convolutional Networks for Biomedical Image Segmentation, 2015.
- Progressive Distillation for Fast Sampling of Diffusion Models, 2022.
- Moûsai: Text-to-Music Generation with Long-Context Latent Diffusion, 2023.
- Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4779–4783. IEEE, 2018.
- Denoising Diffusion Implicit Models, 2022.
- Sequence to Sequence Learning with Neural Networks, 2014.
- TFGAN: Time and Frequency Domain Based Generative Adversarial Network for High-Fidelity Speech Synthesis, 2020.
- Modelling Low-Resource Accents without Accent-Specific TTS Frontend, 2023.
- WaveNet: A Generative Model for Raw Audio, 2016.
- Visualizing Data using t-SNE. Journal of Machine Learning Research, 9(86):2579–2605, 2008. URL http://jmlr.org/papers/v9/vandermaaten08a.html.
- Attention is All You Need. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
- CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit. 2016.
- Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers. arXiv preprint arXiv:2301.02111, 2023a.
- VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised Speech Representation Disentanglement for One-Shot Voice Conversion, 2021.
- VioLA: Unified Codec Language Models for Speech Recognition, Synthesis, and Translation. arXiv preprint arXiv:2305.16107, 2023b.
- Disentangled Representation Learning, 2023c.
- Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis, 2018.
- ESPnet: End-to-End Speech Processing Toolkit. In Proceedings of Interspeech, pp. 2207–2211, 2018. doi: 10.21437/Interspeech.2018-1456.
- UniAudio: An Audio Foundation Model Toward Universal Audio Generation. arXiv preprint arXiv:2310.00704, 2023a.
- Diffusion Models: A Comprehensive Survey of Methods and Applications, 2024.
- What Can an Accent Identifier Learn? Probing Phonetic and Prosodic Information in a Wav2Vec2-Based Accent Identification Model, 2023b.
- Towards Universal Speech Discrete Tokens: A Case Study for ASR and TTS, 2023c.
- SingFake: Singing Voice Deepfake Detection, 2024.
- SoundStream: An End-to-End Neural Audio Codec. 2021.
- iEmoTTS: Toward Robust Cross-Speaker Emotion Transfer and Control for Speech Synthesis Based on Disentanglement between Prosody and Timbre, 2023a.
- What to Remember: Self-Adaptive Continual Learning for Audio Deepfake Detection, 2023b.
- AccentSpeech: Learning Accent from Crowd-Sourced Data for Target Speaker TTS with Accents. In 2022 13th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 76–80. IEEE, 2022a.
- Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages. arXiv preprint arXiv:2303.01037, 2023c.
- Mix and Match: An Empirical Study on Training Corpus Composition for Polyglot Text-To-Speech (TTS), 2022b.
- Accent Conversion Using Phonetic Posteriorgrams. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5314–5318, 2018. doi: 10.1109/ICASSP.2018.8462258.
- Foreign Accent Conversion by Synthesizing Speech from Phonetic Posteriorgrams. pp. 2843–2847, 09 2019. doi: 10.21437/Interspeech.2019-1778.
- Accented Text-to-Speech Synthesis with Limited Data, 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.