Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale
Abstract: Large-scale generative models such as GPT and DALL-E have revolutionized the research community. These models not only generate high fidelity outputs, but are also generalists which can solve tasks not explicitly taught. In contrast, speech generative models are still primitive in terms of scale and task generalization. In this paper, we present Voicebox, the most versatile text-guided generative model for speech at scale. Voicebox is a non-autoregressive flow-matching model trained to infill speech, given audio context and text, trained on over 50K hours of speech that are not filtered or enhanced. Similar to GPT, Voicebox can perform many different tasks through in-context learning, but is more flexible as it can also condition on future context. Voicebox can be used for mono or cross-lingual zero-shot text-to-speech synthesis, noise removal, content editing, style conversion, and diverse sample generation. In particular, Voicebox outperforms the state-of-the-art zero-shot TTS model VALL-E on both intelligibility (5.9% vs 1.9% word error rates) and audio similarity (0.580 vs 0.681) while being up to 20 times faster. Audio samples can be found in \url{https://voicebox.metademolab.com}.
- Scaling laws for generative mixed-modal language models. ArXiv, abs/2301.03728, 2023.
- Expressive speech synthesis via modeling expressions with variational autoencoder. ArXiv, abs/1804.02135, 2018.
- Common voice: A massively-multilingual speech corpus. In International Conference on Language Resources and Evaluation, 2019.
- XLS-R: self-supervised cross-lingual speech representation learning at scale. In H. Ko and J. H. L. Hansen, editors, Interspeech 2022, 23rd Annual Conference of the International Speech Communication Association, Incheon, Korea, 18-22 September 2022, pages 2278–2282. ISCA, 2022.
- wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 2020.
- A3T: Alignment-aware acoustic and text pretraining for speech synthesis and editing. In International Conference on Machine Learning, 2022.
- AudioLM: a language modeling approach to audio generation. ArXiv, abs/2209.03143, 2022a.
- SpeechPainter: Text-conditioned speech inpainting. In Interspeech, 2022b.
- Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096, 2018.
- Language models are few-shot learners. ArXiv, abs/2005.14165, 2020.
- YourTTS: Towards zero-shot multi-speaker tts and zero-shot voice conversion for everyone. In International Conference on Machine Learning, 2021.
- Tts-portuguese corpus: a corpus for speech synthesis in brazilian portuguese. Language Resources and Evaluation, 56(3):1043–1055, 2022.
- R. T. Q. Chen. torchdiffeq, 2018. URL https://github.com/rtqichen/torchdiffeq.
- Neural ordinary differential equations. In Neural Information Processing Systems, 2018.
- Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518, 2022.
- Real time speech enhancement in the waveform domain. ArXiv, abs/2006.12847, 2020.
- High fidelity neural audio compression. ArXiv, abs/2210.13438, 2022.
- ECAPA-TDNN: Emphasized Channel Attention, propagation and aggregation in TDNN based speaker verification. In Interspeech, 2020.
- P. Dhariwal and A. Nichol. Diffusion models beat GANs on image synthesis. Advances in Neural Information Processing Systems, 2021.
- Switchboard: Telephone speech corpus for research and development. In Acoustics, Speech, and Signal Processing, IEEE International Conference on, volume 1, pages 517–520. IEEE Computer Society, 1992.
- Conformer: Convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100, 2020.
- GANs trained by a two time-scale update rule converge to a local Nash equilibrium. Advances in neural information processing systems, 2017.
- J. Ho and T. Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
- Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 2020.
- Training compute-optimal large language models. ArXiv, abs/2203.15556, 2022.
- Hierarchical generative modeling for controllable speech synthesis. In International Conference on Learning Representations, 2019.
- Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460, 2021.
- Revise: Self-supervised speech resynthesis with visual input for universal and generalized speech enhancement. arXiv preprint arXiv:2212.11377, 2022.
- FastDiff: A fast conditional diffusion model for high-quality speech synthesis. In International Joint Conference on Artificial Intelligence, 2022.
- Transfer learning from speaker verification to multispeaker text-to-speech synthesis. Advances in neural information processing systems, 2018.
- Libri-Light: A benchmark for asr with limited or no supervision. International Conference on Acoustics, Speech and Signal Processing, 2019.
- StarGAN-VC: non-parallel many-to-many voice conversion using star generative adversarial networks. IEEE Spoken Language Technology Workshop, 2018.
- Text-free prosody-aware generative spoken language modeling. In Annual Meeting of the Association for Computational Linguistics, 2021.
- Speak, read and prompt: High-fidelity text-to-speech with minimal supervision, 2023.
- Fréchet audio distance: A reference-free metric for evaluating music enhancement algorithms. In Interspeech, 2019.
- Glow-TTS: A generative flow for text-to-speech via monotonic alignment search. Advances in Neural Information Processing Systems, 2020.
- Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In International Conference on Machine Learning, 2021.
- D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
- D. P. Kingma and P. Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. Advances in neural information processing systems, 31, 2018.
- Textless speech emotion conversion using decomposed and discrete representations. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022.
- R. Kubichek. Mel-cepstral distance measure for objective speech quality assessment. In Proceedings of IEEE pacific rim conference on communications computers and signal processing, volume 1, pages 125–128. IEEE, 1993.
- On generative spoken language modeling from raw audio. Transactions of the Association for Computational Linguistics, 9:1336–1354, 2021.
- A. Łańcucki. Fastpitch: Parallel text-to-speech with pitch prediction. In International Conference on Acoustics, Speech and Signal Processing, 2021.
- Sdr–half-baked or well done? In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 626–630. IEEE, 2019.
- Flow matching for generative modeling. In International Conference on Learning Representations, 2023.
- The voice conversion challenge 2018: Promoting development of parallel and nonparallel methods. ArXiv, abs/1804.04262, 2018.
- Montreal forced aligner: Trainable text-speech alignment using kaldi. In Interspeech, 2017.
- Generative spoken dialogue language modeling. Transactions of the Association for Computational Linguistics, 11:250–266, 2022.
- GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models. In International Conference on Machine Learning, 2021.
- Librispeech: An asr corpus based on public domain audio books. International Conference on Acoustics, Speech and Signal Processing, 2015.
- SpecAugment: A simple data augmentation method for automatic speech recognition. In Interspeech, 2019.
- Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 2019.
- Speech resynthesis from discrete disentangled self-supervised representations. In Interspeech, 2021.
- Grad-TTS: A diffusion probabilistic model for text-to-speech. In International Conference on Machine Learning, 2021.
- The kaldi speech recognition toolkit. In Workshop on automatic speech recognition and understanding, 2011.
- Train short, test long: Attention with linear biases enables input length extrapolation. ArXiv, abs/2108.12409, 2021.
- Robust speech recognition via large-scale weak supervision. ArXiv, abs/2212.04356, 2022.
- Zero-shot text-to-image generation. ArXiv, abs/2102.12092, 2021.
- Fastspeech 2: Fast and high-quality end-to-end text to speech. In International Conference on Learning Representations, 2021.
- CrowdMOS: An approach for crowdsourcing mean opinion score studies. In International Conference on Acoustics, Speech and Signal Processing, 2011.
- Sequence-to-sequence modelling of F0 for speech emotion conversion. In International Conference on Acoustics, Speech and Signal Processing, 2019.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
- Palette: Image-to-image diffusion models. In ACM SIGGRAPH 2022 Conference Proceedings, 2022.
- Universal speech enhancement with score-based diffusion. ArXiv, abs/2206.03065, 2022.
- Natural TTS synthesis by conditioning wavenet on mel spectrogram predictions. International Conference on Acoustics, Speech and Signal Processing, 2017.
- Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers. arXiv preprint arXiv:2304.09116, 2023.
- Towards end-to-end prosody transfer for expressive speech synthesis with tacotron. In international conference on machine learning, pages 4693–4702. PMLR, 2018.
- Y. Song and S. Ermon. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems, 32, 2019.
- NaturalSpeech: End-to-end text to speech synthesis with human-level quality. ArXiv, abs/2205.04421, 2022.
- Attention is all you need. ArXiv, abs/1706.03762, 2017.
- fairseq s22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT: A scalable and integrable speech synthesis toolkit. In Conference on Empirical Methods in Natural Language Processing, 2021.
- Neural codec language models are zero-shot text to speech synthesizers. ArXiv, abs/2301.02111, 2023.
- Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis. In International Conference on Machine Learning, 2018.
- A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23(1):7–19, 2014.
- Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit (version 0.92). 2019.
- Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. In International Conference on Acoustics, Speech and Signal Processing, 2020.
- Artificial fingerprinting for generative models: Rooting deepfake attribution in training data. In Proceedings of the IEEE/CVF International conference on computer vision, pages 14448–14457, 2021.
- Soundstream: An end-to-end neural audio codec. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:495–507, 2022.
- Libritts: A corpus derived from librispeech for text-to-speech. arXiv preprint arXiv:1904.02882, 2019.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.