The Gift of Feedback: Improving ASR Model Quality by Learning from User Corrections through Federated Learning
Abstract: Automatic speech recognition (ASR) models are typically trained on large datasets of transcribed speech. As language evolves and new terms come into use, these models can become outdated and stale. In the context of models trained on the server but deployed on edge devices, errors may result from the mismatch between server training data and actual on-device usage. In this work, we seek to continually learn from on-device user corrections through Federated Learning (FL) to address this issue. We explore techniques to target fresh terms that the model has not previously encountered, learn long-tail words, and mitigate catastrophic forgetting. In experimental evaluations, we find that the proposed techniques improve model recognition of fresh terms, while preserving quality on the overall language distribution.
- K. Bonawitz, H. Eichner, W. Grieskamp et al., “Towards federated learning at scale: System design,” 2019.
- H. B. McMahan, E. Moore et al., “Communication-Efficient Learning of Deep Networks from Decentralized Data,” in Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, AISTATS 2017, 20-22 April 2017, Fort Lauderdale, FL, USA, ser. Proceedings of Machine Learning Research, A. Singh and X. J. Zhu, Eds., vol. 54. PMLR, 2017, pp. 1273–1282. http://proceedings.mlr.press/v54/mcmahan17a.html
- J. Xue, J. Han, T. Zheng et al., “Hard sample mining for the improved retraining of automatic speech recognition,” 2019.
- L. Qu, C. Weber, and S. Wermter, “Emphasizing unseen words: New vocabulary acquisition for end-to-end speech recognition,” Neural Networks, vol. 161, pp. 494–504, apr 2023. https://doi.org/10.1016%2Fj.neunet.2023.01.027
- K. C. Sim, F. Beaufays, A. Benard et al., “Personalization of end-to-end speech recognition on mobile devices for named entities,” 2019.
- U. Alon, G. Pundak, and T. N. Sainath, “Contextual speech recognition with difficult negative training examples,” 2018.
- R. M. French, “Catastrophic forgetting in connectionist networks,” Trends in Cognitive Sciences, vol. 3, no. 4, pp. 128–135, 1999.
- M. McCloskey and N. J. Cohen, “Catastrophic interference in connectionist networks: The sequential learning problem,” ser. Psychology of Learning and Motivation, G. H. Bower, Ed. Academic Press, 1989, vol. 24, pp. 109–165.
- S. V. Eeckt and H. V. hamme, “Weight averaging: A simple yet effective method to overcome catastrophic forgetting in automatic speech recognition,” 2023.
- S. Augenstein, A. Hard, L. Ning et al., “Mixed federated learning: Joint decentralized and centralized learning,” 2022.
- W. Zhu, P. Kairouz, B. McMahan et al., “Federated heavy hitters discovery with differential privacy,” 2020.
- A. van den Oord, Y. Li, I. Babuschkin et al., “Parallel wavenet: Fast high-fidelity speech synthesis,” 2017.
- A. Graves, A. Mohamed, and G. Hinton, “Speech Recognition with Deep Recurrent Neural Networks,” in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 2013, pp. 6645–6649.
- T. N. Sainath, Y. He, B. Li et al., “A streaming on-device end-to-end model surpassing server-side conventional model quality and latency,” 2020.
- Y. He, T. N. Sainath, R. Prabhavalkar et al., “Streaming end-to-end speech recognition for mobile devices,” 2018.
- B. Li, A. Gulati, J. Yu et al., “A better and faster end-to-end model for streaming asr,” 2021.
- T. N. Sainath, Y. He, A. Narayanan et al., “An efficient streaming non-recurrent on-device end-to-end model with improvements to rare-word modeling,” in Proc. of INTERSPEECH, 2021.
- Z. Allen-Zhu, Y. Li, and Y. Liang, “Learning and generalization in overparameterized neural networks, going beyond two layers,” CoRR, vol. abs/1811.04918, 2018. http://arxiv.org/abs/1811.04918
- B. Neyshabur, Z. Li, S. Bhojanapalli et al., “Towards understanding the role of over-parametrization in generalization of neural networks,” CoRR, vol. abs/1805.12076, 2018. http://arxiv.org/abs/1805.12076
- R. Prabhavalkar, T. N. Sainath, Y. Wu et al., “Minimum word error rate training for attention-based sequence-to-sequence models,” 2017.
- A. Misra, D. Hwang, Z. Huo et al., “A Comparison of Supervised and Unsupervised Pre-Training of End-to-End Models,” in Proc. Interspeech 2021, 2021, pp. 731–735.
- A. Narayanan, R. Prabhavalkar, C.-C. Chiu et al., “Recognizing long-form speech using streaming end-to-end models,” in IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019, Singapore, December 14-18, 2019. IEEE, 2019, pp. 920–927. https://doi.org/10.1109/ASRU46091.2019.9003913
- Google, “Artificial intelligence at Google: Our principles.” https://ai.google/principles/
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.