Papers
Topics
Authors
Recent
Search
2000 character limit reached

UniEnc-CASSNAT: An Encoder-only Non-autoregressive ASR for Speech SSL Models

Published 14 Feb 2024 in eess.AS, cs.CL, and cs.SD | (2402.08898v1)

Abstract: Non-autoregressive automatic speech recognition (NASR) models have gained attention due to their parallelism and fast inference. The encoder-based NASR, e.g. connectionist temporal classification (CTC), can be initialized from the speech foundation models (SFM) but does not account for any dependencies among intermediate tokens. The encoder-decoder-based NASR, like CTC alignment-based single-step non-autoregressive transformer (CASS-NAT), can mitigate the dependency problem but is not able to efficiently integrate SFM. Inspired by the success of recent work of speech-text joint pre-training with a shared transformer encoder, we propose a new encoder-based NASR, UniEnc-CASSNAT, to combine the advantages of CTC and CASS-NAT. UniEnc-CASSNAT consists of only an encoder as the major module, which can be the SFM. The encoder plays the role of both the CASS-NAT encoder and decoder by two forward passes. The first pass of the encoder accepts the speech signal as input, while the concatenation of the speech signal and the token-level acoustic embedding is used as the input for the second pass. Examined on the Librispeech 100h, MyST, and Aishell1 datasets, the proposed UniEnc-CASSNAT achieves state-of-the-art NASR results and is better or comparable to CASS-NAT with only an encoder and hence, fewer model parameters. Our codes are publicly available.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. “Bigssl: Exploring the frontier of large-scale semi-supervised learning for automatic speech recognition,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1519–1532, 2022.
  2. “Self-supervised speech representation learning: A review,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1179–1210, 2022.
  3. “An exploration of self-supervised pretrained representations for end-to-end speech recognition,” in IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2021, pp. 228–235.
  4. “Exploring the limits of transfer learning with a unified text-to-text transformer,” The Journal of Machine Learning Research, vol. 21, no. 1, pp. 5485–5551, 2020.
  5. Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of NAACL-HLT, 2019, pp. 4171–4186.
  6. “Generative pre-training for speech with autoregressive predictive coding,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 3497–3501.
  7. “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in neural information processing systems, vol. 33, pp. 12449–12460, 2020.
  8. Wei-Ning Hsu et al., “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Trans. Audio Speech Lang. Process., vol. 29, pp. 3451–3460, 2021.
  9. “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, 2022.
  10. “Towards better domain adaptation for self-supervised models: A case study of child asr,” IEEE Journal of Selected Topics in Signal Processing, 2022.
  11. “Non-autoregressive transformer for speech recognition,” IEEE Signal Process. Letter, vol. 28, pp. 121–125, 2021.
  12. “Relaxing the conditional independence assumption of ctc-based ASR by conditioning on intermediate predictions,” in Interspeech, 2021, pp. 3735–3739.
  13. “A comparative study on non-autoregressive modelings for speech-to-text generation,” in 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2021, pp. 47–54.
  14. “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proceedings of the 23rd international conference on Machine learning, 2006, pp. 369–376.
  15. “Pushing the Limits of Non-Autoregressive Speech Recognition,” in Proc. Interspeech 2021, 2021, pp. 3725–3729.
  16. “Mask CTC: non-autoregressive end-to-end ASR with CTC and mask predict,” in Interspeech, 2020, pp. 3655–3659.
  17. “Align-refine: Non-autoregressive speech recognition via iterative realignment,” in Proceedings of NAACL-HLT, 2021, pp. 1920–1927.
  18. “Fast end-to-end speech recognition via non-autoregressive models and cross-modal knowledge transferring from bert,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 1897–1911, 2021.
  19. “CASS-NAT: CTC alignment-based single step non-autoregressive transformer for speech recognition,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 5889–5893.
  20. “Paraformer: Fast and accurate parallel transformer for non-autoregressive end-to-end speech recognition,” in Interspeech, 2022.
  21. “SpeechUT: Bridging speech and text with hidden-unit for encoder-decoder based speech-text pre-training,” in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, Dec. 2022, pp. 1663–1676, Association for Computational Linguistics.
  22. “Speecht5: Unified-modal encoder-decoder pre-training for spoken language processing,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 5723–5738.
  23. “A ctc alignment-based non-autoregressive transformer for end-to-end automatic speech recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 1436–1448, 2023.
  24. “Improving non-autoregressive end-to-end speech recognition with pre-trained acoustic and language models,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 8522–8526.
  25. “BERT meets CTC: New formulation of end-to-end speech recognition with pre-trained masked language model,” in Findings of the Association for Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab Emirates, Dec. 2022, pp. 5486–5503, Association for Computational Linguistics.
  26. “An improved single step non-autoregressive transformer for automatic speech recognition,” in Interspeech, 2021, pp. 3715–3719.
  27. “Librispeech: an asr corpus based on public domain audio books,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015, pp. 5206–5210.
  28. “My science tutor: A conversational multimedia virtual tutor for elementary school science,” ACM Transactions on Speech and Language Processing (TSLP), vol. 7, no. 4, pp. 1–29, 2011.
  29. “Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline,” in 20th conference of the oriental chapter of the international coordinating committee on speech databases and speech I/O systems and assessment (O-COCOSDA). IEEE, 2017, pp. 1–5.
  30. “Unified speech-text pre-training for speech translation and recognition,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), May 2022.
  31. “Understanding shared speech-text representations,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
  32. “SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Brussels, Belgium, Nov. 2018, pp. 66–71, Association for Computational Linguistics.
  33. “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  34. “Ctcbert: Advancing hidden-unit bert with ctc objectives,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5.
  35. “Wav-bert: Cooperative acoustic and linguistic representation learning for low-resource speech recognition,” in Findings of the Association for Computational Linguistics: EMNLP 2021, 2021, pp. 2765–2777.
  36. “Wav2seq: Pre-training speech-to-text encoder-decoder models using pseudo languages,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 7 likes about this paper.