Towards Universal Speech Discrete Tokens: A Case Study for ASR and TTS

Published 14 Sep 2023 in eess.AS and cs.SD | (2309.07377v2)

Abstract: Self-supervised learning (SSL) proficiency in speech-related tasks has driven research into utilizing discrete tokens for speech tasks like recognition and translation, which offer lower storage requirements and great potential to employ natural language processing techniques. However, these studies, mainly single-task focused, faced challenges like overfitting and performance degradation in speech recognition tasks, often at the cost of sacrificing performance in multi-task scenarios. This study presents a comprehensive comparison and optimization of discrete tokens generated by various leading SSL models in speech recognition and synthesis tasks. We aim to explore the universality of speech discrete tokens across multiple speech tasks. Experimental results demonstrate that discrete tokens achieve comparable results against systems trained on FBank features in speech recognition tasks and outperform mel-spectrogram features in speech synthesis in subjective and objective metrics. These findings suggest that universal discrete tokens have enormous potential in various speech-related tasks. Our work is open-source and publicly available at https://github.com/k2-fsa/icefall.

Abstract PDF HTML Upgrade to Chat

Citations (22)

View on Semantic Scholar

Summary

The paper demonstrates that SSL-derived discrete tokens can bridge ASR and TTS tasks, achieving competitive recognition on LibriSpeech and superior resynthesis quality in TTS.
The methodology employs vector quantization and k-means clustering alongside tailored augmentation techniques to optimize token granularity and robustness.
Experimental results reveal that while discrete tokens excel in English low-resource settings, challenges persist in noisy and non-English environments, suggesting avenues for further research.

Towards Universal Speech Discrete Tokens: A Case Study for ASR and TTS

Introduction to Speech Discrete Tokens

The paper "Towards Universal Speech Discrete Tokens: A Case Study for ASR and TTS" explores the challenges and potential of utilizing discrete tokens in speech processing tasks such as Automatic Speech Recognition (ASR) and Text-to-Speech (TTS). Discrete tokens allow for lower storage requirements and the application of NLP techniques. The study compares and optimizes discrete tokens generated by leading self-supervised learning (SSL) models. This research aims to uncover the universality of speech discrete tokens across different speech tasks, providing evidence for their potential comparable performance in recognition tasks and superiority in synthesis tasks.

Figure 1: Illustration of the pipeline for speech discrete tokens.

Discrete Speech Tokens from SSL Models

The use of discrete tokens derived from SSL models like vq-wav2vec, HuBERT, and WavLM is central to this study. These models provide semantic tokens trained for linguistic information. Discretization can be achieved using Vector Quantization (VQ) or k-means clustering on embeddings, with the choice of clustering parameters impacting the granularity of the discretization. Discrete tokens balance semantic and acoustic features, aiming to serve a universal function across speech tasks.

Speech Recognition with Discrete Tokens

ASR with Discretized Input

The study utilizes discrete tokens from different SSL models for training end-to-end ASR systems. Discrete tokens are embedded and processed to a uniform frame rate before input into Transducer-based ASR models. Customized augmentation techniques like time and embedding masking are employed to mitigate overfitting and enhance robustness in ASR training.

Results and Analysis

Experiments show that discrete tokens generated from WavLM and HuBERT models reveal promising ASR performance on English corpora like LibriSpeech, especially for low-resource scenarios. While discrete tokens achieve competitive results compared to traditional features on clean data, their performance lags in noisy or linguistically different settings, such as the Mandarin AISHELL-1 corpus. It highlights the challenge of generalization across varied acoustic conditions and languages.

Speech Synthesis with Discrete Tokens

In TTS tasks, discrete tokens' performance is benchmarked against conventional mel-spectrogram features through speech resynthesis experiments using CTX-vec2wav. These tokens have shown to outperform mel-spectrogram in subjective and objective evaluations, demonstrating their capability to produce high-quality audio irrespective of additional acoustic context modeling.

Experimental Insights

Comparison Across Tasks

The universality of discrete tokens is evaluated, showing their competitive performance in recognition and superiority in synthesis tasks. Discrete tokens from models like Encodec and DAC offer alternatives for efficient audio processing and compression, with variable performance in synthesis quality, revealing potential for further exploitation.

Ablation Studies

Ablation studies detail the impact of various augmentation strategies on ASR performance. Findings suggest that SpecAugment-style augmentations adapted for token sequences can significantly improve model robustness.

Conclusions

The exploration into universal discrete tokens underscores their promising adaptability across speech-related tasks. The findings support the thesis that discrete tokens could serve as a valuable bridging representation between speech and text tasks. Future research could benefit from refining token generation techniques and improving cross-linguistic generalization.

The research provides a critical step in understanding the application of SSL-derived discrete tokens and sets the stage for further explorations into bridging speech and text through a unified token representation.