Vec-Tok Speech: speech vectorization and tokenization for neural speech generation

Published 11 Oct 2023 in cs.SD and eess.AS | (2310.07246v2)

Abstract: LLMs (LMs) have recently flourished in natural language processing and computer vision, generating high-fidelity texts or images in various tasks. In contrast, the current speech generative models are still struggling regarding speech quality and task generalization. This paper presents Vec-Tok Speech, an extensible framework that resembles multiple speech generation tasks, generating expressive and high-fidelity speech. Specifically, we propose a novel speech codec based on speech vectors and semantic tokens. Speech vectors contain acoustic details contributing to high-fidelity speech reconstruction, while semantic tokens focus on the linguistic content of speech, facilitating language modeling. Based on the proposed speech codec, Vec-Tok Speech leverages an LM to undertake the core of speech generation. Moreover, Byte-Pair Encoding (BPE) is introduced to reduce the token length and bit rate for lower exposure bias and longer context coverage, improving the performance of LMs. Vec-Tok Speech can be used for intra- and cross-lingual zero-shot voice conversion (VC), zero-shot speaking style transfer text-to-speech (TTS), speech-to-speech translation (S2ST), speech denoising, and speaker de-identification and anonymization. Experiments show that Vec-Tok Speech, built on 50k hours of speech, performs better than other SOTA models. Code will be available at https://github.com/BakerBunker/VecTok .

Abstract PDF HTML Upgrade to Chat

References (44)

Citations (11)

View on Semantic Scholar

Summary

The paper introduces a novel speech codec that combines speech vectors with semantic tokens to improve high-fidelity reconstruction and language modeling.
It leverages 50,000 hours of multi-domain speech data to excel in voice conversion, text-to-speech, and speech-to-speech translation with strong zero-shot capabilities.
The model integrates Byte-Pair Encoding (BPE) to reduce token length and extend context, outperforming state-of-the-art models in speech quality and speaker identity preservation.

Overview of "Vec-Tok Speech: Speech Vectorization and Tokenization for Neural Speech Generation"

The paper entitled "Vec-Tok Speech: Speech Vectorization and Tokenization for Neural Speech Generation" introduces a framework aimed at enhancing the capabilities of speech generation systems. This framework, termed Vec-Tok Speech, focuses on utilizing a novel codec for speech vectorization and tokenization to address the limitations observed in existing speech generative models in terms of speech quality and task generalization.

Core Innovations

Novel Speech Codec: The core innovation in Vec-Tok Speech is the development of a new codec that effectively combines speech vectors and semantic tokens. This dual representation captures both the acoustic and linguistic elements of speech. Speech vectors are designed to retain detailed acoustic features necessary for high-fidelity speech reconstruction, while semantic tokens encapsulate linguistic content, facilitating efficient language modeling.
Large-Scale Data Utilization: The Vec-Tok Speech model is trained on a massive dataset of 50,000 hours of multi-domain speech, allowing it to perform competitively across various speech tasks such as voice conversion (VC), text-to-speech (TTS), and speech-to-speech translation (S2ST), both intra- and cross-lingually.
Byte-Pair Encoding (BPE) for Token Optimization: To reduce token length and improve the efficiency of LMs, the framework incorporates Byte-Pair Encoding (BPE). This reduces exposure bias and extends context coverage, which enhances the flexibility and robustness of speech generation tasks.

Experimental Results

The experimentation demonstrated the model's superiority over state-of-the-art (SOTA) models in several key metrics:

Speech Quality: Vec-Tok Speech achieved higher mean opinion scores (MOS) in speech naturalness when compared to models like LM-VC and VALL-E X for zero-shot VC and TTS tasks, respectively.
Speaker Identity Preservation: The model effectively maintains the speaker's identity across translations, as evidenced by high speaker similarity scores and cosine similarity metrics.
Zero-shot Capability: The framework exhibits robust zero-shot performance, particularly for TTS applications, allowing for style transfer using separate prompts for speaker and style identity, which is a novel capability not offered by peer models.

Theoretical and Practical Implications

The introduction of Vec-Tok Speech implies a significant stride towards bridging the gap between text and speech modalities using large-scale LLMs. The dual focus on high-fidelity reconstruction and efficient tokenization addresses bottlenecks in existing speech generative frameworks, making the model both scalable and adaptable across lingual boundaries. This advancement is pertinent for diverse applications requiring swift and accurate speech synthesis and conversion, including real-time translation and personalized assistive technologies.

Future Directions

Future research could explore further optimization in token compression techniques and investigate the extension of these methods to other languages and dialects, potentially involving cross-modal learning paradigms. Additionally, refining the model for real-time applications and expanding its adaptability to different acoustic environments and languages would be valuable.

Vec-Tok Speech represents a step forward in the synthesis of high-quality, expressive, and adaptive speech, providing a robust framework for multiple speech processing applications while laying the groundwork for integrating speech generation technologies with broader AI systems.