Vec-Tok Speech: speech vectorization and tokenization for neural speech generation
Abstract: LLMs (LMs) have recently flourished in natural language processing and computer vision, generating high-fidelity texts or images in various tasks. In contrast, the current speech generative models are still struggling regarding speech quality and task generalization. This paper presents Vec-Tok Speech, an extensible framework that resembles multiple speech generation tasks, generating expressive and high-fidelity speech. Specifically, we propose a novel speech codec based on speech vectors and semantic tokens. Speech vectors contain acoustic details contributing to high-fidelity speech reconstruction, while semantic tokens focus on the linguistic content of speech, facilitating language modeling. Based on the proposed speech codec, Vec-Tok Speech leverages an LM to undertake the core of speech generation. Moreover, Byte-Pair Encoding (BPE) is introduced to reduce the token length and bit rate for lower exposure bias and longer context coverage, improving the performance of LMs. Vec-Tok Speech can be used for intra- and cross-lingual zero-shot voice conversion (VC), zero-shot speaking style transfer text-to-speech (TTS), speech-to-speech translation (S2ST), speech denoising, and speaker de-identification and anonymization. Experiments show that Vec-Tok Speech, built on 50k hours of speech, performs better than other SOTA models. Code will be available at https://github.com/BakerBunker/VecTok .
- Hifi++: a unified framework for neural vocoding, bandwidth extension and speech enhancement. CoRR, abs/2203.13086, 2022.
- Voice conversion with just nearest neighbors. CoRR, abs/2305.18975, 2023.
- Jason Baldridge. Verbmobil: Foundations of Speech-to-Speech Translation, by wolfgang wahlster (editor). springer, 2000. ISBN 3-540-67783-6. price £44.50 (hardback). xii+679 pages. Nat. Lang. Eng., 10(2):200–204, 2004.
- James Betker. Better speech synthesis through scaling. CoRR, abs/2305.07243, 2023.
- Audiolm: A language modeling approach to audio generation. IEEE ACM Trans. Audio Speech Lang. Process., 31:2523–2533, 2023a.
- Soundstorm: Efficient parallel audio generation. CoRR, abs/2305.09636, 2023b.
- Language models are few-shot learners. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
- Bytedance. Gigas2s: Large scale english-to-x speech-to-speech translation. https://github.com/SpeechTranslation/GigaS2S, 2023.
- Yourtts: Towards zero-shot multi-speaker TTS and zero-shot voice conversion for everyone. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu, and Sivan Sabato (eds.), International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pp. 2709–2720. PMLR, 2022.
- Gigaspeech: An evolving, multi-domain ASR corpus with 10, 000 hours of transcribed audio. In Proc. Interspeech.
- Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE J. Sel. Top. Signal Process., 16(6):1505–1518, 2022.
- w2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training. In IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2021, Cartagena, Colombia, December 13-17, 2021, pp. 244–250. IEEE, 2021.
- High fidelity neural audio compression. CoRR, abs/2210.13438, 2022.
- ECAPA-TDNN: emphasized channel attention, propagation and aggregation in TDNN based speaker verification. In Proc. Interspeech, pp. 3830–3834. ISCA, 2020.
- Philip Gage. A new algorithm for data compression. The C Users Journal archive, 12:23–38, 1994.
- Speak, read and prompt: High-fidelity text-to-speech with minimal supervision. CoRR, abs/2302.03540, 2023.
- Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
- Melgan: Generative adversarial networks for conditional waveform synthesis. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp. 14881–14892, 2019.
- Autoencoding beyond pixels using a learned similarity metric. In Maria-Florina Balcan and Kilian Q. Weinberger (eds.), Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, volume 48 of JMLR Workshop and Conference Proceedings, pp. 1558–1566. JMLR.org, 2016.
- Voicebox: Text-guided multilingual universal speech generation at scale. CoRR, abs/2306.15687, 2023.
- Emotional voice conversion with cycle-consistent adversarial network, 2020.
- Translatotron 3: Speech to speech translation with monolingual data. CoRR, abs/2305.17547, 2023.
- A time delay neural network architecture for efficient modeling of long temporal contexts. In Interspeech, 2015.
- Deepfilternet2: Towards real-time speech enhancement on embedded devices for full-band audio, 2022.
- Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers. CoRR, abs/2304.09116, 2023.
- Styles2st: Zero-shot style transfer for direct speech-to-speech translation. CoRR, abs/2305.17732, 2023.
- Privacy and utility of x-vector based speaker anonymization. IEEE ACM Trans. Audio Speech Lang. Process., 30:2383–2395, 2022.
- Llama: Open and efficient foundation language models. CoRR, abs/2302.13971, 2023.
- Superseded - cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit. 2016.
- Neural codec language models are zero-shot text to speech synthesizers. CoRR, abs/2301.02111, 2023a.
- LM-VC: zero-shot voice conversion via speech generation based on language models. IEEE Signal Process. Lett., 30:1157–1161, 2023b.
- Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process., 13(4):600–612, 2004.
- Speechgen: Unlocking the generative power of speech language models with prompts. CoRR, abs/2306.02207, 2023a.
- Audiodec: An open-source streaming high-fidelity neural audio codec. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5, 2023b.
- Hifi-codec: Group-residual vector quantization for high fidelity audio codec. CoRR, abs/2305.02765, 2023a.
- HYBRIDFORMER: improving squeezeformer with hybrid attention and NSR mechanism. CoRR, abs/2303.08636, 2023b.
- Wenet: Production oriented streaming and non-streaming end-to-end speech recognition toolkit. In Hynek Hermansky, Honza Cernocký, Lukás Burget, Lori Lamel, Odette Scharenborg, and Petr MotlÃcek (eds.), Interspeech 2021, 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, 30 August - 3 September 2021, pp. 4054–4058. ISCA, 2021.
- Gigast: A 10, 000-hour pseudo speech translation corpus. CoRR, abs/2204.03939, 2022.
- Deid-vc: Speaker de-identification via zero-shot pseudo voice conversion. In Hanseok Ko and John H. L. Hansen (eds.), Interspeech 2022, 23rd Annual Conference of the International Speech Communication Association, Incheon, Korea, 18-22 September 2022, pp. 2593–2597. ISCA, 2022.
- Soundstream: An end-to-end neural audio codec. IEEE ACM Trans. Audio Speech Lang. Process., 30:495–507, 2022.
- Libritts: A corpus derived from librispeech for text-to-speech. In Gernot Kubin and Zdravko Kacic (eds.), Interspeech 2019, 20th Annual Conference of the International Speech Communication Association, Graz, Austria, 15-19 September 2019, pp. 1526–1530. ISCA, 2019.
- WENETSPEECH: A 10000+ hours multi-domain mandarin corpus for speech recognition. In Proc. ICASSP, pp. 6182–6186. IEEE, 2022.
- Speak foreign languages with your own voice: Cross-lingual neural codec language modeling. CoRR, abs/2303.03926, 2023.
- Multi-speaker expressive speech synthesis via multiple factors decoupling. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5, 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.