Discrete Audio Representation as an Alternative to Mel-Spectrograms for Speaker and Speech Recognition

Published 19 Sep 2023 in eess.AS and cs.SD | (2309.10922v1)

Abstract: Discrete audio representation, aka audio tokenization, has seen renewed interest driven by its potential to facilitate the application of text language modeling approaches in audio domain. To this end, various compression and representation-learning based tokenization schemes have been proposed. However, there is limited investigation into the performance of compression-based audio tokens compared to well-established mel-spectrogram features across various speaker and speech related tasks. In this paper, we evaluate compression based audio tokens on three tasks: Speaker Verification, Diarization and (Multi-lingual) Speech Recognition. Our findings indicate that (i) the models trained on audio tokens perform competitively, on average within $1\%$ of mel-spectrogram features for all the tasks considered, and do not surpass them yet. (ii) these models exhibit robustness for out-of-domain narrowband data, particularly in speaker tasks. (iii) audio tokens allow for compression to 20x compared to mel-spectrogram features with minimal loss of performance in speech and speaker related tasks, which is crucial for low bit-rate applications, and (iv) the examined Residual Vector Quantization (RVQ) based audio tokenizer exhibits a low-pass frequency response characteristic, offering a plausible explanation for the observed results, and providing insight for future tokenizer designs.

Abstract PDF HTML Upgrade to Chat

References (33)

Citations (9)

View on Semantic Scholar

Summary

The paper demonstrates that discrete audio tokens via EnCodec perform comparably to mel-spectrograms across speaker verification, diarization, and ASR tasks.
It employs neural compression techniques like Residual Vector Quantization on datasets such as VoxCeleb and LibriSpeech, achieving robust results even at low bit-rates.
The findings indicate that discrete representations can offer efficiency and robustness gains, making them promising for low-bandwidth audio applications.

Discrete Audio Representation as an Alternative to Mel-Spectrograms for Speaker and Speech Recognition

Introduction

The development of discrete audio representation, also known as audio tokenization, has gained traction for its applicability in adapting NLP techniques to the audio domain. This paper reviews the utility of audio tokens derived from compression-based methods, specifically EnCodec, for tasks such as Speaker Verification, Diarization, and Speech Recognition. The study emphasizes the comparative performance between these tokens and traditional mel-spectrogram features across various use cases.

Audio Tokenization Approach

The research uses state-of-the-art neural compression techniques such as Residual Vector Quantization (RVQ) embodied in models like Soundstream, EnCodec, and DAC. These models function by partitioning continuous audio signals into sequences of discrete tokens, thereby enabling efficient modeling in transformer-based architectures. In particular, EnCodec is selected for this study due to its architecture, which involves 32 codebooks with significant token compression capabilities.

Experimental Setup

Models were trained and evaluated on prominent datasets including VoxCeleb, NIST-SRE, and LibriSpeech, using neural architectures like TitaNet for speaker tasks and Fast Conformer for ASR. The focus remained on determining the efficiency and robustness of audio tokens against varying conditions, such as in-domain and out-of-domain data.

Results and Findings

Speaker Verification

Evaluating models on datasets such as VoxCeleb1-Clean and NIST-SRE 18 revealed that audio tokens trained on EnCodec perform comparably to those trained on mel-spectrograms, with minimal degradation (within 1\% difference). Notably, they showed enhanced performance on narrowband data, surpassing mel-spectrogram efficacy by significant margins.

Speaker Diarization

In diarization tasks, EnCodec tokens maintained performance with average DER close to that of mel-spectrograms. The robustness of these tokens owes partly to their inherent low-pass filtering characteristics, which were empirically observed by examining the spectral response of EnCodec-compressed audio.

Figure 1: Estimated transfer function of EnCodec: Computed by comparing the spectral content of original audio and its compressed-decompressed version using EnCodec over 1000 random samples from VoxCeleb dataset.

Automatic Speech Recognition

For ASR tasks, EnCodec tokens achieved WERs close to mel-spectrogram-featured models, particularly in clean conditions, though slightly lagging in more challenging 'other' scenarios. This showcases the potential of these tokens for efficient speech recognition systems, especially in resource-constrained settings.

Figure 2: EnCodec vs DAC: WER (\%) comparison when trained on LS-960h dataset. EnCodec tokens show better overall performance than DAC tokens.

Bit-Rate and Tokenizer Analysis

The study also investigated performance variation at different tokenization bit-rates. Even at significantly reduced bit-rates (3 kbps with just 4 codebooks), models retained respectable performance. Comparisons with alternative tokenizer DAC illuminated slight superiority of EnCodec in maintaining high fidelity at competitive bit-rates.

Conclusion

The research affirms the viability of using discrete audio tokens as effective substitutes for mel-spectrogram features. These tokens not only deliver comparable performance across multiple tasks but also offer enhanced robustness and significant data size reduction, indicating their suitability for low-bandwidth applications. Future work could focus on fine-tuning RVQ-driven token design to overcome current performance gaps and explore further applications in auditory understanding tasks.