Denoising LM: Pushing the Limits of Error Correction Models for Speech Recognition
Abstract: LLMs (LMs) have long been used to improve results of automatic speech recognition (ASR) systems, but they are unaware of the errors that ASR systems make. Error correction models are designed to fix ASR errors, however, they showed little improvement over traditional LMs mainly due to the lack of supervised training data. In this paper, we present Denoising LM (DLM), which is a $\textit{scaled}$ error correction model trained with vast amounts of synthetic data, significantly exceeding prior attempts meanwhile achieving new state-of-the-art ASR performance. We use text-to-speech (TTS) systems to synthesize audio, which is fed into an ASR system to produce noisy hypotheses, which are then paired with the original texts to train the DLM. DLM has several $\textit{key ingredients}$: (i) up-scaled model and data; (ii) usage of multi-speaker TTS systems; (iii) combination of multiple noise augmentation strategies; and (iv) new decoding techniques. With a Transformer-CTC ASR, DLM achieves 1.5% word error rate (WER) on $\textit{test-clean}$ and 3.3% WER on $\textit{test-other}$ on Librispeech, which to our knowledge are the best reported numbers in the setting where no external audio data are used and even match self-supervised methods which use external audio data. Furthermore, a single DLM is applicable to different ASRs, and greatly surpassing the performance of conventional LM based beam-search rescoring. These results indicate that properly investigated error correction models have the potential to replace conventional LMs, holding the key to a new level of accuracy in ASR systems.
- https://github.com/coqui-ai/TTS.
- Whisper pretrained models. https://github.com/openai/whisper.
- wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33:12449–12460, 2020.
- Yourtts: Towards zero-shot multi-speaker tts and zero-shot voice conversion for everyone. In International Conference on Machine Learning, pages 2709–2720. PMLR, 2022.
- Listen, attend and spell. arXiv preprint arXiv:1508.01211, 2015.
- Latticebart: Lattice-to-lattice pre-training for speech recognition. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6112–6116. IEEE, 2022.
- Error correction in asr using sequence-to-sequence models. arXiv preprint arXiv:2202.01157, 2022.
- Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd International Conference on Machine Learning, pages 369–376, 2006.
- Conformer: Convolution-augmented transformer for speech recognition. In Proc. Interspeech, 2020.
- On using monolingual corpora in neural machine translation. arXiv preprint arXiv:1503.03535, 2015.
- A spelling correction model for end-to-end speech recognition. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5651–5655. IEEE, 2019.
- Contextnet: Improving convolutional neural networks for automatic speech recognition with global context. In Proc. Interspeech, 2020.
- Ted-lium 3: Twice as much data and corpus repartition for experiments on speaker adaptation. In Speech and Computer: 20th International Conference, SPECOM 2018, Leipzig, Germany, September 18–22, 2018, Proceedings 20, pages 198–208. Springer, 2018.
- Minimum word error training of long short-term memory recurrent neural network language models for speech recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5990–5994, 2016.
- Correction of automatic speech recognition with transformer sequence-to-sequence model. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7074–7078. IEEE, 2020.
- Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460, 2021.
- Synt++: Utilizing imperfect synthetic data to improve speech recognition. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7682–7686. IEEE, 2022.
- Application of pretrained deep neural networks to large vocabulary speech recognition. In Proc. Interspeech, pages 2578–2581, 2012.
- Libri-light: A benchmark for asr with limited or no supervision. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7669–7673. IEEE, 2020.
- Libriheavy: a 50,000 hours asr corpus with punctuation casing and context. arXiv preprint arXiv:2309.08105, 2023.
- An analysis of incorporating an external language model into a sequence-to-sequence model. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5828. IEEE, 2018.
- E-branchformer: Branchformer with enhanced merging for speech recognition. In 2022 IEEE Spoken Language Technology Workshop (SLT), pages 84–91. IEEE, 2023.
- Brian Kingsbury. Lattice-based optimization of sequence classification criteria for neural-network acoustic modeling. In 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 3761–3764. IEEE, 2009.
- Quartznet: Deep automatic speech recognition with 1d time-channel separable convolutions. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6124–6128. IEEE, 2020.
- Fastcorrect 2: Fast error correction on multiple candidates for automatic speech recognition. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 4328–4337, 2021.
- Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, 2020.
- N-best t5: Robust asr error correction using multiple input hypotheses and constrained decoding space. arXiv preprint arXiv:2303.00456, 2023.
- Neural lattice search for speech recognition. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7794–7798. IEEE, 2020.
- Can generative large language models perform asr error correction? arXiv preprint arXiv:2307.04172, 2023.
- A density ratio approach to language model fusion in end-to-end automatic speech recognition. 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 434–441, 2019.
- Internal language model estimation for domain-adaptive end-to-end speech recognition. In 2021 IEEE Spoken Language Technology Workshop (SLT), pages 243–250. IEEE, 2021.
- Reward augmented maximum likelihood for neural structured prediction. Advances In Neural Information Processing Systems, 29, 2016.
- Asapp-asr: Multistream cnn and self-attentive sru for sota speech recognition. In Proc. Interspeech, 2020.
- Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5206–5210. IEEE, 2015.
- SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition. In Proc. Interspeech, pages 2613–2617, 2019.
- Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning, pages 28492–28518. PMLR, 2023.
- Whispering LLaMA: A cross-modal generative error correction framework for speech recognition. In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023.
- Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 4779–4783. IEEE, 2018.
- Learning from past mistakes: improving automatic speech recognition output via noisy-clean phrase context modeling. APSIPA Transactions on Signal and Information Processing, 8:e8, 2019.
- Contextual spelling correction with large language models. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 1–8. IEEE, 2023.
- Cold fusion: Training seq2seq models together with language models, 2018.
- End-to-end ASR: from supervised to semi-supervised learning with modern architectures. In ICML 2020 Workshop on Self-supervision in Audio and Speech, 2020.
- Neural error corrective language models for automatic speech recognition. In Proc. Interspeech, pages 401–405, 2018.
- A comparison of techniques for language model integration in encoder-decoder speech recognition. In 2018 IEEE spoken language technology workshop (SLT), pages 369–375. IEEE, 2018.
- Deep neural networks for small footprint text-dependent speaker verification. In 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 4052–4056. IEEE, 2014.
- Hybrid autoregressive transducer (hat). In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6139–6143, 2020.
- Sequence-discriminative training of deep neural networks. In Proc. Interspeech, pages 2345–2349, 2013.
- Asr error correction with augmented transformer for entity retrieval. In Proc. Interspeech, 2020.
- Iterative Pseudo-Labeling for Speech Recognition. In Proc. Interspeech, pages 1006–1010, 2020.
- Investigation of Transformer Based Spelling Correction Model for CTC-Based End-to-End Mandarin Speech Recognition. In Proc. Interspeech, pages 2180–2184, 2019.
- Pushing the limits of semi-supervised learning for automatic speech recognition. arXiv preprint arXiv:2010.10504, 2020.
- Bart based semantic correction for mandarin automatic speech recognition system. arXiv preprint arXiv:2104.05507, 2021.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.