ProGRes: Prompted Generative Rescoring on ASR n-Best

Published 30 Aug 2024 in cs.CL, cs.SD, and eess.AS | (2409.00217v2)

Abstract: LLMs have shown their ability to improve the performance of speech recognizers by effectively rescoring the n-best hypotheses generated during the beam search process. However, the best way to exploit recent generative instruction-tuned LLMs for hypothesis rescoring is still unclear. This paper proposes a novel method that uses instruction-tuned LLMs to dynamically expand the n-best speech recognition hypotheses with new hypotheses generated through appropriately-prompted LLMs. Specifically, we introduce a new zero-shot method for ASR n-best rescoring, which combines confidence scores, LLM sequence scoring, and prompt-based hypothesis generation. We compare Llama-3-Instruct, GPT-3.5 Turbo, and GPT-4 Turbo as prompt-based generators with Llama-3 as sequence scorer LLM. We evaluated our approach using different speech recognizers and observed significant relative improvement in the word error rate (WER) ranging from 5% to 25%.

Abstract PDF HTML Upgrade to Chat

Citations (1)

View on Semantic Scholar

Summary

The paper presents a novel method using LLMs to expand and rescore ASR n-best hypotheses, significantly reducing Word Error Rate.
It integrates instruction-tuned LLMs with traditional ASR systems to generate additional candidates and combine linguistic plausibility with confidence scores.
Empirical results show improvements of up to 25% in WER, highlighting the potential for high-fidelity transcriptions in diverse applications.

ProGRes: Prompted Generative Rescoring on ASR N-Best

The paper "ProGRes: Prompted Generative Rescoring on ASR N-Best" authored by Ada Defne Tur, Adel Moumen, and Mirco Ravanelli, explores an innovative method for enhancing Automatic Speech Recognition (ASR) using LLMs. The proposed methodology leverages the advanced capabilities of instruction-tuned generative LLMs to improve the quality of ASR transcriptions by dynamically expanding and rescoring $n$ -best hypothesis lists generated during the beam search process.

Overview of the ProGRes Methodology

The core of the ProGRes methodology revolves around integrating LLMs with traditional ASR systems to address common transcription challenges, such as noise, reverberation, and the inclusion of named entities. The proposed system, PROmpted Generative REScoring (ProGRes), operates as follows:

Generation of $n$ -Best Hypotheses: The initial step involves generating a set of $n$ -best candidate transcriptions using a pretrained ASR model.
Prompted Hypothesis Expansion: The $n$ -best list is fed into an instruction-tuned LLM (e.g., GPT-4 Turbo, GPT-3.5 Turbo, or Llama-3), which generates additional hypotheses based on carefully crafted prompts.
Rescoring of Hypotheses: A state-of-the-art open-weight LLM (e.g., Llama-3) assigns linguistic plausibility scores to the extended set of hypotheses, which are then interpolated with the ASR confidence scores to select the most accurate transcription.

Key Contributions and Results

The paper provides several innovative contributions to the field:

Dynamic Hypothesis Expansion: By using LLMs to generate additional hypotheses, the method significantly improves the likelihood of including a correct transcription with the final set, thereby reducing the overall Word Error Rate (WER).
Effective Combination of Scores: The novel approach of interpolating ASR confidence scores with LLM-derived linguistic plausibility scores leads to more reliable final transcriptions.
Practical Performance Gains: The method demonstrated significant improvements in WER in both mismatched training conditions (ASR $_1$ on SPGISpeech evaluated on CommonVoice) and matched conditions (ASR $_2$ on CommonVoice), with reductions ranging from 5% to 25%.

Evaluation and Analysis

The paper presents comprehensive experimental results using different ASR models and LLM configurations:

ASR $_1$ and ASR $_2$ Performance: The baseline WER for ASR $_1$ was 42.94%, which improved to 40.84% using GPT-4 within the ProGRes system. For ASR $_2$ , the improvement was more pronounced, reducing WER from 12.38% to 9.32%.
LLM Comparative Analysis: Among the LLMs tested, GPT-4 yielded the best performance, followed by GPT-3.5 and Llama-3, highlighting the potential of utilizing more sophisticated LLMs for ASR rescoring.
Oracle Comparison: The oracle WER (choosing the best hypothesis from the extended set) showed that there is still room for improvement, emphasizing the potential of further optimizing the scoring and combining strategies.

Practical and Theoretical Implications

Practical Implications:

Enhanced Transcription Accuracy: The method holds significant promise for domains requiring high transcription fidelity, such as medical dictation, legal transcriptions, and technical domains involving unique terminology.
Flexibility and Adaptability: ProGRes can be integrated with various existing ASR systems, making it a versatile tool for improving transcription quality across different applications.

Theoretical Implications:

Advancements in LLM-ASR Integration: The paper advances the theoretical understanding of how generative LLMs can complement traditional ASR models, potentially guiding future research on hybrid LLM-ASR systems.
Prompts and Linguistic Plausibility: The research highlights the importance of prompt engineering and the potential of LLMs to perform zero-shot rescorings, paving the way for further exploration of prompt-based AI systems.

Future Directions

The authors acknowledge the computational complexity associated with integrating LLMs and propose several future research avenues:

Optimizing Computational Efficiency: Future work could focus on reducing the resource demands of LLMs, potentially involving techniques such as distillation or quantization.
Domain-Specific Fine-Tuning: Fine-tuning LLMs on specific domains could further enhance transcription accuracy, especially in specialized fields requiring domain-specific knowledge.
Expanding Dataset and ASR Coverage: Evaluating ProGRes on a broader set of ASR systems and datasets will help generalize its applicability and verify its effectiveness across different speech recognition scenarios.

In conclusion, the ProGRes method presents a significant step forward in improving ASR performance by effectively harnessing the capabilities of modern LLMs. The promising results indicate substantial potential for practical applications and theoretical advancements in the domain of speech recognition.

Markdown Report Issue