UniEnc-CASSNAT: An Encoder-only Non-autoregressive ASR for Speech SSL Models
Abstract: Non-autoregressive automatic speech recognition (NASR) models have gained attention due to their parallelism and fast inference. The encoder-based NASR, e.g. connectionist temporal classification (CTC), can be initialized from the speech foundation models (SFM) but does not account for any dependencies among intermediate tokens. The encoder-decoder-based NASR, like CTC alignment-based single-step non-autoregressive transformer (CASS-NAT), can mitigate the dependency problem but is not able to efficiently integrate SFM. Inspired by the success of recent work of speech-text joint pre-training with a shared transformer encoder, we propose a new encoder-based NASR, UniEnc-CASSNAT, to combine the advantages of CTC and CASS-NAT. UniEnc-CASSNAT consists of only an encoder as the major module, which can be the SFM. The encoder plays the role of both the CASS-NAT encoder and decoder by two forward passes. The first pass of the encoder accepts the speech signal as input, while the concatenation of the speech signal and the token-level acoustic embedding is used as the input for the second pass. Examined on the Librispeech 100h, MyST, and Aishell1 datasets, the proposed UniEnc-CASSNAT achieves state-of-the-art NASR results and is better or comparable to CASS-NAT with only an encoder and hence, fewer model parameters. Our codes are publicly available.
- “Bigssl: Exploring the frontier of large-scale semi-supervised learning for automatic speech recognition,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1519–1532, 2022.
- “Self-supervised speech representation learning: A review,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1179–1210, 2022.
- “An exploration of self-supervised pretrained representations for end-to-end speech recognition,” in IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2021, pp. 228–235.
- “Exploring the limits of transfer learning with a unified text-to-text transformer,” The Journal of Machine Learning Research, vol. 21, no. 1, pp. 5485–5551, 2020.
- Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of NAACL-HLT, 2019, pp. 4171–4186.
- “Generative pre-training for speech with autoregressive predictive coding,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 3497–3501.
- “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in neural information processing systems, vol. 33, pp. 12449–12460, 2020.
- Wei-Ning Hsu et al., “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Trans. Audio Speech Lang. Process., vol. 29, pp. 3451–3460, 2021.
- “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, 2022.
- “Towards better domain adaptation for self-supervised models: A case study of child asr,” IEEE Journal of Selected Topics in Signal Processing, 2022.
- “Non-autoregressive transformer for speech recognition,” IEEE Signal Process. Letter, vol. 28, pp. 121–125, 2021.
- “Relaxing the conditional independence assumption of ctc-based ASR by conditioning on intermediate predictions,” in Interspeech, 2021, pp. 3735–3739.
- “A comparative study on non-autoregressive modelings for speech-to-text generation,” in 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2021, pp. 47–54.
- “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proceedings of the 23rd international conference on Machine learning, 2006, pp. 369–376.
- “Pushing the Limits of Non-Autoregressive Speech Recognition,” in Proc. Interspeech 2021, 2021, pp. 3725–3729.
- “Mask CTC: non-autoregressive end-to-end ASR with CTC and mask predict,” in Interspeech, 2020, pp. 3655–3659.
- “Align-refine: Non-autoregressive speech recognition via iterative realignment,” in Proceedings of NAACL-HLT, 2021, pp. 1920–1927.
- “Fast end-to-end speech recognition via non-autoregressive models and cross-modal knowledge transferring from bert,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 1897–1911, 2021.
- “CASS-NAT: CTC alignment-based single step non-autoregressive transformer for speech recognition,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 5889–5893.
- “Paraformer: Fast and accurate parallel transformer for non-autoregressive end-to-end speech recognition,” in Interspeech, 2022.
- “SpeechUT: Bridging speech and text with hidden-unit for encoder-decoder based speech-text pre-training,” in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, Dec. 2022, pp. 1663–1676, Association for Computational Linguistics.
- “Speecht5: Unified-modal encoder-decoder pre-training for spoken language processing,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 5723–5738.
- “A ctc alignment-based non-autoregressive transformer for end-to-end automatic speech recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 1436–1448, 2023.
- “Improving non-autoregressive end-to-end speech recognition with pre-trained acoustic and language models,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 8522–8526.
- “BERT meets CTC: New formulation of end-to-end speech recognition with pre-trained masked language model,” in Findings of the Association for Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab Emirates, Dec. 2022, pp. 5486–5503, Association for Computational Linguistics.
- “An improved single step non-autoregressive transformer for automatic speech recognition,” in Interspeech, 2021, pp. 3715–3719.
- “Librispeech: an asr corpus based on public domain audio books,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015, pp. 5206–5210.
- “My science tutor: A conversational multimedia virtual tutor for elementary school science,” ACM Transactions on Speech and Language Processing (TSLP), vol. 7, no. 4, pp. 1–29, 2011.
- “Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline,” in 20th conference of the oriental chapter of the international coordinating committee on speech databases and speech I/O systems and assessment (O-COCOSDA). IEEE, 2017, pp. 1–5.
- “Unified speech-text pre-training for speech translation and recognition,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), May 2022.
- “Understanding shared speech-text representations,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
- “SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Brussels, Belgium, Nov. 2018, pp. 66–71, Association for Computational Linguistics.
- “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
- “Ctcbert: Advancing hidden-unit bert with ctc objectives,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5.
- “Wav-bert: Cooperative acoustic and linguistic representation learning for low-resource speech recognition,” in Findings of the Association for Computational Linguistics: EMNLP 2021, 2021, pp. 2765–2777.
- “Wav2seq: Pre-training speech-to-text encoder-decoder models using pseudo languages,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.