Multilingual and Fully Non-Autoregressive ASR with Large Language Model Fusion: A Comprehensive Study
Abstract: In the era of large models, the autoregressive nature of decoding often results in latency serving as a significant bottleneck. We propose a non-autoregressive LM-fused ASR system that effectively leverages the parallelization capabilities of accelerator hardware. Our approach combines the Universal Speech Model (USM) and the PaLM 2 LLM in per-segment scoring mode, achieving an average relative WER improvement across all languages of 10.8% on FLEURS and 3.6% on YouTube captioning. Furthermore, our comprehensive ablation study analyzes key parameters such as LLM size, context length, vocabulary size, fusion methodology. For instance, we explore the impact of LLM size ranging from 128M to 340B parameters on ASR performance. This study provides valuable insights into the factors influencing the effectiveness of practical large-scale LM-fused speech recognition systems.
- V. Pratap et al., “Massively multilingual asr: 50 languages, 1 model, 1 billion parameters,” Proc. Interspeech 2020, pp. 4751–4755, 2020.
- B. Li et al., “Scaling end-to-end models for large-scale multilingual asr,” in Proc. ASRU. IEEE, 2021, pp. 1011–1018.
- Y. Zhang et al., “Bigssl: Exploring the frontier of large-scale semi-supervised learning for automatic speech recognition,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1519–1532, 2022.
- A. Radford et al., “Robust speech recognition via large-scale weak supervision,” in International Conference on Machine Learning. PMLR, 2023, pp. 28 492–28 518.
- W. Chen et al., “Improving massively multilingual asr with auxiliary ctc objectives,” in Proc. ICASSP. IEEE, 2023, pp. 1–5.
- M. Shoeybi et al., “Megatron-lm: Training multi-billion parameter language models using model parallelism,” arXiv preprint arXiv:1909.08053, 2019.
- T. Brown et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
- H. Touvron et al., “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023.
- K. Hu et al., “Massively multilingual shallow fusion with large language models,” in Proc. ICASSP. IEEE, 2023, pp. 1–5.
- Y. Zhang et al., “Google usm: Scaling automatic speech recognition beyond 100 languages,” arXiv preprint arXiv:2303.01037, 2023.
- R. Anil et al., “Palm 2 technical report,” arXiv preprint arXiv:2305.10403, 2023.
- J. Salazar, D. Liang, T. Q. Nguyen, and K. Kirchhoff, “Masked language model scoring,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 2699–2712.
- J. Wei et al., “Emergent abilities of large language models,” arXiv preprint arXiv:2206.07682, 2022.
- T. Chen et al., “Large-scale language model rescoring on long-form data,” in Proc. ICASSP. IEEE, 2023, pp. 1–5.
- C. Raffel et al., “Exploring the limits of transfer learning with a unified text-to-text transformer,” J. Mach. Learn. Res., vol. 21, no. 1, jun 2020.
- A. Chowdhery et al., “Palm: Scaling language modeling with pathways,” arXiv preprint arXiv:2204.02311, 2022.
- F.-H. Yu, K.-Y. Chen, and K.-H. Lu, “Non-autoregressive asr modeling using pre-trained language models for chinese speech recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 1474–1482, 2022.
- Y. Bai et al., “Fast end-to-end speech recognition via non-autoregressive models and cross-modal knowledge transferring from bert,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 1897–1911, 2021.
- J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” CoRR, vol. abs/1810.04805, 2018. [Online]. Available: http://arxiv.org/abs/1810.04805
- A. Kannan et al., “An analysis of incorporating an external language model into a sequence-to-sequence model,” in Proc. ICASSP. IEEE, 2018, pp. 1–5828.
- Y. Li, Y. Wu, J. Li, and S. Liu, “Prompting large language models for zero-shot domain adaptation in speech recognition,” arXiv preprint arXiv:2306.16007, 2023.
- A. Gulati et al., “Conformer: Convolution-augmented transformer for speech recognition,” in Proc. Interspeech, 2020, pp. 5036–5040.
- T. Wang et al., “What language model architecture and pretraining objective works best for zero-shot generalization?” in International Conference on Machine Learning. PMLR, 2022, pp. 22 964–22 984.
- H. Soltau, H. Liao, and H. Sak, “Reducing the Computational Complexity for Whole Word Models,” in Proc. ASRU. IEEE, 2017, pp. 63–68.
- C.-C. Chiu, W. Han, Y. Zhang et al., “A comparison of end-to-end models for long-form speech recognition,” in Proc. ASRU. IEEE, 2019, pp. 889–896.
- C.-C. Chiu et al., “Rnn-t models fail to generalize to out-of-domain audio: Causes and solutions,” in Proc. SLT. IEEE, 2021, pp. 873–880.
- W. R. Huang et al., “E2e segmenter: Joint segmenting and decoding for long-form asr,” Proc. Interspeech 2022, pp. 4995–4999, 2022.
- A. Conneau et al., “Fleurs: Few-shot learning evaluation of universal representations of speech,” in Proc. SLT. IEEE, 2023, pp. 798–805.
- S. Bubeck et al., “Sparks of artificial general intelligence: Early experiments with gpt-4,” arXiv preprint arXiv:2303.12712, 2023.
- W. R. Huang et al., “E2e segmentation in a two-pass cascaded encoder asr model,” in Proc. ICASSP. IEEE, 2023, pp. 1–5.
- W. R. Huang, H. Zhang, S. Kumar, S.-y. Chang, and T. N. Sainath, “Semantic segmentation with bidirectional language models improves long-form asr,” arXiv preprint arXiv:2305.18419, 2023.
- R. Zazo Candil, T. N. Sainath, G. Simko, and C. Parada, “Feature learning with raw-waveform cldnns for voice activity detection,” Proc. Interspeech 2016, pp. 3668–3672, 2016.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.