Papers
Topics
Authors
Recent
Search
2000 character limit reached

Partial Rewriting for Multi-Stage ASR

Published 8 Dec 2023 in cs.CL | (2312.09463v1)

Abstract: For many streaming automatic speech recognition tasks, it is important to provide timely intermediate streaming results, while refining a high quality final result. This can be done using a multi-stage architecture, where a small left-context only model creates streaming results and a larger left- and right-context model produces a final result at the end. While this significantly improves the quality of the final results without compromising the streaming emission latency of the system, streaming results do not benefit from the quality improvements. Here, we propose using a text manipulation algorithm that merges the streaming outputs of both models. We improve the quality of streaming results by around 10%, without altering the final results. Our approach introduces no additional latency and reduces flickering. It is also lightweight, does not require retraining the model, and it can be applied to a wide variety of multi-stage architectures.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (28)
  1. Dong Yu and Li Deng. Automatic Speech Recognition. Springer London, 2015.
  2. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 2012.
  3. Partial traceback and dynamic programming. ICASSP, 1982.
  4. Stability and accuracy in incremental speech recognition. SIGDIAL, 2011.
  5. Incremental speech recognition for multimodal interfaces. Interspeech, 1998.
  6. Towards automatic closed captioning: low latency real time broadcast news transcription. Interspeech, 2002.
  7. Improving RNN transducer modeling for end-to-end speech recognition. ASRU, 2019.
  8. Transformer-transducer: End-to-end speech recognition with self-attention. CoRR, 2019.
  9. A streaming on-device end-to-end model surpassing server-side conventional model quality and latency. ICASSP, 2020.
  10. Streaming end-to-end speech recognition for mobile device. ICASSP, 2019.
  11. Cascaded encoders for unifying streaming and non-streaming ASR. ICASPP, 2021.
  12. An efficient streaming non-recurrent on-device end-to-end model with improvements to rare-word modeling. Interspeech, 2021.
  13. Improving the latency and quality of cascaded encoders. ICASSP, 2022.
  14. Streaming parallel transducer beam search with fast-slow cascaded encoders. CoRR, 2022.
  15. Improving fast-slow encoder based transducer with streaming deliberation. ICASSP, 2023.
  16. Synchronization accessibility user requirements. https://www.w3.org/TR/saur/#caption-synchronization-thresholds, 2022. [Online; accessed 23-July-2022].
  17. Evaluation of the kit lecture translation system. LREC, 2016.
  18. Streaming cascade-based speech translation leveraged by a direct segmentation model. EMNLP, 2020.
  19. Low latency speech recognition using end-to-end prefetching. Interspeech, 2020.
  20. Analyzing the quality and stability of a streaming end-to-end on-device speech recognizer. Interspeech, 2020.
  21. Flickering reduction with partial hypothesis reranking for streaming ASR. In SLT, 2022.
  22. Introduction to Algorithms. The MIT Press, 2nd edition, 2001. ISBN 0262032937.
  23. Librispeech: an ASR corpus based on public domain audio books. In ICASSP, pages 5206–5210. IEEE, 2015.
  24. Google, Artificial Intelligence at Google: Our Principles. https://ai.google/principles/.
  25. Fastemit: Low-latency streaming ASR with sequence-level emission regularization. ICASSP, 2021.
  26. Tied & reduced RNN-T decoder. arXiv preprint arXiv:2109.07513, 2021.
  27. Recognizing long-form speech using streaming end-to-end models. ASRU, 2019.
  28. Supplementary material. Contains three videos from utterances from Librispeech. A vertical bar shows the boundary between the text coming from the causal model and the cascaded one. This vertical border is added artificially in order to understand better the effect of the algorithm and is not present in the production code. We show an utterance where the PWER is improved greatly. In particular, we can see that in the base model, the correction only happens at the very end, when the final result comes from the cascaded model, rather than being corrected mid-decoding. We also show an example where the PWER is degraded. While the cascaded model has better quality on average, it sometimes performs worse than the causal model. Finally, we also show an example of high increase in flickering.

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.