Streaming End-to-end Speech Recognition For Mobile Devices

Published 15 Nov 2018 in cs.CL | (1811.06621v1)

Abstract: End-to-end (E2E) models, which directly predict output character sequences given input speech, are good candidates for on-device speech recognition. E2E models, however, present numerous challenges: In order to be truly useful, such models must decode speech utterances in a streaming fashion, in real time; they must be robust to the long tail of use cases; they must be able to leverage user-specific context (e.g., contact lists); and above all, they must be extremely accurate. In this work, we describe our efforts at building an E2E speech recognizer using a recurrent neural network transducer. In experimental evaluations, we find that the proposed approach can outperform a conventional CTC-based model in terms of both latency and accuracy in a number of evaluation categories.

Abstract PDF Upgrade to Chat

Authors (20)

First 10 authors:

Citations (614)

View on Semantic Scholar

Summary

The paper introduces an RNN-T model that improves latency and accuracy over CTC models for on-device speech recognition.
It implements uni-directional LSTM layers with projection and a time-reduction layer to accelerate training and inference.
The study demonstrates over a 20% reduction in word error rate with techniques like quantization and contextual biasing for real-time mobile use.

Streaming End-to-End Speech Recognition for Mobile Devices: A Comprehensive Analysis

This paper presents advancements in on-device end-to-end (E2E) speech recognition by employing a Recurrent Neural Network Transducer (RNN-T) approach. The motivation is to replace server-based systems with models that run entirely on mobile devices, thereby addressing concerns about reliability, latency, and privacy.

The authors highlight the challenges of developing an E2E model that meets the demands of mobile applications, emphasizing the necessity for streaming capabilities, user-context adaptability, and high accuracy. The proposed RNN-T model surpasses conventional systems based on Connectionist Temporal Classification (CTC) models in both latency and accuracy, making it suitable for mobile platforms.

Architectural Innovations

The RNN-T model's architecture relies on uni-directional Long Short-Term Memory (LSTM) layers augmented with a projection layer to minimize computational demands. A notable addition is the time-reduction layer, which reduces the input frame rate, significantly speeding up the training and inference processes.

Training and Inference Optimizations

The authors have made several training optimizations, such as the application of layer normalization for stable hidden state dynamics and leveraging large batch sizes on Tensor Processing Units (TPUs) to accelerate training. For inference, innovations like state caching and concurrent threading improve the run-time performance, facilitating execution faster than real time on mobile devices.

Contextual Biasing and Text Normalization

RNN-T models incorporate a shallow-fusion approach for contextual biasing, allowing the model to leverage user-specific contexts, such as contact lists or song titles. This biasing approach outperforms conventional methods in most scenarios, demonstrating the model's capability to integrate domain-specific knowledge effectively.

Furthermore, the text normalization challenges faced by E2E models, particularly with numeric sequences, are addressed by augmenting the training dataset with synthetically generated examples. This adaptation enhances the model's performance, reducing error rates significantly for numeric data.

Quantization for Efficiency

Parameter quantization reduces memory usage and execution time by converting weights to 8-bit fixed-point representations. The symmetric quantization method results in a dramatic increase in speed, executing twice as fast as real time while maintaining a compact model size, making it extremely suitable for mobile devices.

Performance Evaluation

The RNN-T model demonstrates a substantial reduction in word error rate (WER) on both voice search and dictation tasks, achieving over 20% relative improvement compared to a strong CTC-based embedded model. The advancements in streaming and decoding speed without sacrificing accuracy underscore the model's suitability for real-time applications.

Implications and Future Directions

This research underscores the suitability of end-to-end models for on-device speech recognition, particularly in applications requiring real-time response and user-context adaptability. Future research avenues may explore further optimization of model architecture, the integration of more user-specific contextual information, and the expansion of training datasets through unsupervised or semi-supervised methodologies.

The techniques and findings presented in this paper lay the groundwork for more sophisticated and efficient on-device speech recognition systems, suggesting a promising direction for AI advancements in mobile technology. These innovations have the potential to redefine user interaction paradigms across diverse applications and devices.

Markdown Report Issue