Multilingual and Fully Non-Autoregressive ASR with Large Language Model Fusion: A Comprehensive Study

Published 23 Jan 2024 in cs.CL, cs.SD, and eess.AS | (2401.12789v1)

Abstract: In the era of large models, the autoregressive nature of decoding often results in latency serving as a significant bottleneck. We propose a non-autoregressive LM-fused ASR system that effectively leverages the parallelization capabilities of accelerator hardware. Our approach combines the Universal Speech Model (USM) and the PaLM 2 LLM in per-segment scoring mode, achieving an average relative WER improvement across all languages of 10.8% on FLEURS and 3.6% on YouTube captioning. Furthermore, our comprehensive ablation study analyzes key parameters such as LLM size, context length, vocabulary size, fusion methodology. For instance, we explore the impact of LLM size ranging from 128M to 340B parameters on ASR performance. This study provides valuable insights into the factors influencing the effectiveness of practical large-scale LM-fused speech recognition systems.

Abstract PDF HTML Upgrade to Chat

References (32)

Citations (8)

View on Semantic Scholar

Summary

The paper introduces a non-autoregressive ASR approach that fuses USM and PaLM 2 for improved multilingual recognition and reduced latency.
It employs a bidirectional attention mechanism and per-segment scoring to efficiently handle long-form audio transcription.
Experimental results on YouTube captions and FLEURS tests highlight notable accuracy improvements with optimal context length and vocabulary adjustments.

Introduction

The current study puts forward a non-autoregressive automatic speech recognition (ASR) system that amalgamates the Universal Speech Model (USM) and the PaLM 2 LLM to improve recognition accuracy across various languages. In an era where latency due to the autoregressive nature of ASR systems constitutes a major hurdle, the proposed method stands out by employing parallelization effectively to minimize delay. This fusion methodology not only enhances recognition accuracy but also delivers a better user experience due to reduced latency.

Prevailing research focuses on the integration of LLMs with ASR systems to capitalize on their extensive linguistic databases and contextual aptitude. The study builds on such work, prominently leveraging non-autoregressive models and shifting the focus to long-form audio tasks. Shallow fusion, a formerly popular approach for short tasks, is replaced by scoring methods to accommodate the length and complexity of the content in applications such as YouTube captioning.

Methodology

The method revolves around two primary components – the USM for generating ASR hypotheses and the PaLM 2 model for scoring these hypotheses. Unique to the USM is its bidirectional attention mechanism, which is trained on a sizable multilingual dataset and designed for both supervised and semi-supervised learning. PaLM 2 employs an extensive vocabulary and showcases capabilities in scoring ASR hypotheses due to improvements in training and extended context length. Non-autoregressive CTC decoding paired with a scoring strategy that incorporates historical context ensures accurate and timely transcription.

Evaluation and Findings

Substantial tests across several languages validate the robustness of the system. Key performance metrics indicate marked improvements in both the YouTube captions and FLEURS test sets. Various dependencies were investigated, such as the size of the LLM, context length, vocabulary size, and the method of segmentation adopted. The exploratory nature of the study delineated several nuanced interactions. For instance, while larger LLMs facilitated reduced sensitivity to scoring weights, there was an optimum context length beyond which additional context ceased to contribute value. Smaller vocabulary models also served as an effective measure in reducing computational costs without significant performance loss.

Moreover, the study sheds light on practical considerations around segmentation methods and the size of the n-best list in hypothesis scoring. Contrasting approaches like shallow fusion were discerned to be computationally heavier than per-segment scoring. While shallow fusion may still be relevant in specific contexts, the superiority of per-segment scoring in streaming applications was evident.

In conclusion, the study presents a scalable solution for multilingual, non-autoregressive ASR through the fusion of LLMs, offering noteworthy improvements in accuracy while addressing latency concerns that impede real-world applications. These findings and the methodology proposed serve as a progressive stride in the development of efficient and practical ASR systems, setting the course for future enhancements and deployments.