Papers
Topics
Authors
Recent
Search
2000 character limit reached

Discrete Audio Representation as an Alternative to Mel-Spectrograms for Speaker and Speech Recognition

Published 19 Sep 2023 in eess.AS and cs.SD | (2309.10922v1)

Abstract: Discrete audio representation, aka audio tokenization, has seen renewed interest driven by its potential to facilitate the application of text language modeling approaches in audio domain. To this end, various compression and representation-learning based tokenization schemes have been proposed. However, there is limited investigation into the performance of compression-based audio tokens compared to well-established mel-spectrogram features across various speaker and speech related tasks. In this paper, we evaluate compression based audio tokens on three tasks: Speaker Verification, Diarization and (Multi-lingual) Speech Recognition. Our findings indicate that (i) the models trained on audio tokens perform competitively, on average within $1\%$ of mel-spectrogram features for all the tasks considered, and do not surpass them yet. (ii) these models exhibit robustness for out-of-domain narrowband data, particularly in speaker tasks. (iii) audio tokens allow for compression to 20x compared to mel-spectrogram features with minimal loss of performance in speech and speaker related tasks, which is crucial for low bit-rate applications, and (iv) the examined Residual Vector Quantization (RVQ) based audio tokenizer exhibits a low-pass frequency response characteristic, offering a plausible explanation for the observed results, and providing insight for future tokenizer designs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (33)
  1. Tom Brown et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
  2. Hugo Touvron et al., “Llama 2: Open foundation and fine-tuned chat models,” 2023.
  3. “Robust speech recognition via large-scale weak supervision,” 2022.
  4. “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021.
  5. “W2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training,” in 2021 IEEE ASRU. IEEE, 2021, pp. 244–250.
  6. “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022.
  7. “Google usm: Scaling automatic speech recognition beyond 100 languages,” arXiv preprint arXiv:2303.01037, 2023.
  8. “Neural codec language models are zero-shot text to speech synthesizers,” arXiv preprint arXiv:2301.02111, 2023.
  9. Ziqiang Zhang et al., “Speak foreign languages with your own voice: Cross-lingual neural codec language modeling,” arXiv preprint arXiv:2303.03926, 2023.
  10. “Viola: Unified codec language models for speech recognition, synthesis, and translation,” arXiv preprint arXiv:2305.16107, 2023.
  11. “Musiclm: Generating music from text,” arXiv preprint arXiv:2301.11325, 2023.
  12. “Audiolm: a language modeling approach to audio generation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
  13. “Audiopalm: A large language model that can speak and listen,” arXiv preprint arXiv:2306.12925, 2023.
  14. “Soundstream: An end-to-end neural audio codec,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 495–507, 2021.
  15. “High fidelity neural audio compression,” arXiv preprint arXiv:2210.13438, 2022.
  16. “High-fidelity audio compression with improved rvqgan,” arXiv preprint arXiv:2306.06546, 2023.
  17. “Titanet: Neural model for speaker representation with 1d depth-wise separable convolutions and global context,” in ICASSP 2022. IEEE, 2022, pp. 8102–8106.
  18. “Fast conformer with linearly scalable attention for efficient speech recognition,” arXiv:2305.05084, 2023.
  19. “Arcface: Additive angular margin loss for deep face recognition,” in CVPR 2019, 2019, pp. 4685–4694.
  20. “Conformer: Convolution-augmented transformer for speech recognition,” Interspeech, 2020.
  21. “Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing,” 2018.
  22. “VoxCeleb2: Deep speaker recognition,” in Interspeech, 2018.
  23. “Librispeech: an asr corpus based on public domain audio books,” in ICASSP 2015. IEEE, 2015, pp. 5206–5210.
  24. “Mls: A large-scale multilingual dataset for speech research,” arXiv preprint arXiv:2012.03411, 2020.
  25. “The ami meeting corpus: A pre-announcement,” in International workshop on machine learning for multimodal interaction. Springer, 2005, pp. 28–39.
  26. “Pyannote. audio: neural building blocks for speaker diarization,” in ICASSP 2020. IEEE, 2020, pp. 7124–7128.
  27. “The nist speaker recognition evaluations: 1996-2001,” in 2001: A Speaker Odyssey-The Speaker Recognition Workshop, 2001.
  28. “Callhome american english speech,” Linguistic Data Consortium, 1997.
  29. “Common voice: A massively-multilingual speech corpus,” arXiv preprint arXiv:1912.06670, 2019.
  30. “Voxpopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation,” arXiv preprint arXiv:2101.00390, 2021.
  31. “Fisher spanish speech (ldc2010s01),” Web Download. Philadelphia: Linguistic Data Consortium, 2010.
  32. “Specaugment: A simple data augmentation method for automatic speech recognition,” Interspeech, 2019.
  33. “NeMo: a toolkit for Conversational AI and Large Language Models,” .
Citations (9)

Summary

  • The paper demonstrates that discrete audio tokens via EnCodec perform comparably to mel-spectrograms across speaker verification, diarization, and ASR tasks.
  • It employs neural compression techniques like Residual Vector Quantization on datasets such as VoxCeleb and LibriSpeech, achieving robust results even at low bit-rates.
  • The findings indicate that discrete representations can offer efficiency and robustness gains, making them promising for low-bandwidth audio applications.

Discrete Audio Representation as an Alternative to Mel-Spectrograms for Speaker and Speech Recognition

Introduction

The development of discrete audio representation, also known as audio tokenization, has gained traction for its applicability in adapting NLP techniques to the audio domain. This paper reviews the utility of audio tokens derived from compression-based methods, specifically EnCodec, for tasks such as Speaker Verification, Diarization, and Speech Recognition. The study emphasizes the comparative performance between these tokens and traditional mel-spectrogram features across various use cases.

Audio Tokenization Approach

The research uses state-of-the-art neural compression techniques such as Residual Vector Quantization (RVQ) embodied in models like Soundstream, EnCodec, and DAC. These models function by partitioning continuous audio signals into sequences of discrete tokens, thereby enabling efficient modeling in transformer-based architectures. In particular, EnCodec is selected for this study due to its architecture, which involves 32 codebooks with significant token compression capabilities.

Experimental Setup

Models were trained and evaluated on prominent datasets including VoxCeleb, NIST-SRE, and LibriSpeech, using neural architectures like TitaNet for speaker tasks and Fast Conformer for ASR. The focus remained on determining the efficiency and robustness of audio tokens against varying conditions, such as in-domain and out-of-domain data.

Results and Findings

Speaker Verification

Evaluating models on datasets such as VoxCeleb1-Clean and NIST-SRE 18 revealed that audio tokens trained on EnCodec perform comparably to those trained on mel-spectrograms, with minimal degradation (within 1\% difference). Notably, they showed enhanced performance on narrowband data, surpassing mel-spectrogram efficacy by significant margins.

Speaker Diarization

In diarization tasks, EnCodec tokens maintained performance with average DER close to that of mel-spectrograms. The robustness of these tokens owes partly to their inherent low-pass filtering characteristics, which were empirically observed by examining the spectral response of EnCodec-compressed audio. Figure 1

Figure 1: Estimated transfer function of EnCodec: Computed by comparing the spectral content of original audio and its compressed-decompressed version using EnCodec over 1000 random samples from VoxCeleb dataset.

Automatic Speech Recognition

For ASR tasks, EnCodec tokens achieved WERs close to mel-spectrogram-featured models, particularly in clean conditions, though slightly lagging in more challenging 'other' scenarios. This showcases the potential of these tokens for efficient speech recognition systems, especially in resource-constrained settings. Figure 2

Figure 2: EnCodec vs DAC: WER (\%) comparison when trained on LS-960h dataset. EnCodec tokens show better overall performance than DAC tokens.

Bit-Rate and Tokenizer Analysis

The study also investigated performance variation at different tokenization bit-rates. Even at significantly reduced bit-rates (3 kbps with just 4 codebooks), models retained respectable performance. Comparisons with alternative tokenizer DAC illuminated slight superiority of EnCodec in maintaining high fidelity at competitive bit-rates.

Conclusion

The research affirms the viability of using discrete audio tokens as effective substitutes for mel-spectrogram features. These tokens not only deliver comparable performance across multiple tasks but also offer enhanced robustness and significant data size reduction, indicating their suitability for low-bandwidth applications. Future work could focus on fine-tuning RVQ-driven token design to overcome current performance gaps and explore further applications in auditory understanding tasks.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.