Investigation of Whisper ASR Hallucinations Induced by Non-Speech Audio

Published 20 Jan 2025 in cs.SD, cs.AI, and eess.AS | (2501.11378v1)

Abstract: Hallucinations of deep neural models are amongst key challenges in automatic speech recognition (ASR). In this paper, we investigate hallucinations of the Whisper ASR model induced by non-speech audio segments present during inference. By inducting hallucinations with various types of sounds, we show that there exists a set of hallucinations that appear frequently. We then study hallucinations caused by the augmentation of speech with such sounds. Finally, we describe the creation of a bag of hallucinations (BoH) that allows to remove the effect of hallucinations through the post-processing of text transcriptions. The results of our experiments show that such post-processing is capable of reducing word error rate (WER) and acts as a good safeguard against problematic hallucinations.

Abstract PDF Upgrade to Chat

Summary

The paper shows that non-speech audio triggers hallucinations in Whisper ASR, with a 40.3% occurrence rate.
The paper introduces a Bag of Hallucinations (BoH) that reduces erroneous outputs by 67.08% when used with voice activity detection.
The paper highlights that combining BoH with VAD and model adjustments significantly improves transcription accuracy and ASR robustness.

Investigation of Whisper ASR Hallucinations Induced by Non-Speech Audio

Introduction

This paper addresses the issue of hallucinations in Automatic Speech Recognition (ASR) systems, particularly focusing on the Whisper ASR model. Hallucinations refer to the erroneous outputs generated by ASR systems when they attempt to transcribe audio content devoid of speech. Such phenomena in AI systems are critical because they can lead to inaccurate interpretations of audio inputs, especially in real-world applications where accuracy is paramount.

The study begins by exploring how Whisper ASR is prone to hallucinations when fed with non-speech audio. These hallucinations are outputs that have no phonetic or semantic connection to actual speech input. The paper further investigates the types and frequencies of these hallucinations and proposes a strategy to mitigate their influence on transcription accuracy through post-processing techniques.

Methodology

Data Collection and Experimental Setup

The study employs the Whisper large-v3 model, setting up experiments using non-speech audio files to induce hallucinations. The dataset comprises a variety of sounds from public datasets like Audioset, Musan, UrbanSound8K, and FSD50K, meticulously curated to exclude any form of human vocalization. Additionally, experiment setups include variations in the duration and volume of these sound files to robustly assess their influence on hallucination frequency.

The authors explore hallucinations through three main experimental phases:

Hallucinations from Non-Speech Audio: Analyzing outputs from non-speech inputs to generate an exhaustive list of hallucinations.
Bag of Hallucinations (BoH): Creating a filtered list from the hallucinations to be used in suppression and identification tasks.
Speech Augmentation Studies: Understanding the effect of augmenting speech with non-speech sounds to evaluate hallucination occurrences.

Analysis of Results

Non-Speech Audio Hallucinations

The analysis reveals that a significant proportion (40.3%) of non-speech audio inputs leads to hallucinations. The phenomena of looping, where previously recognized text repeats, are classified as a distinct type of hallucination. The results indicate that simple sounds can prompt the ASR model to output text with elements that might appear misleading or offensive in real-world applications.

Creation and Utilization of BoH

The Bag of Hallucinations, a carefully filtered set of frequent hallucinated outputs, is demonstrated as an effective post-processing strategy. It allows a systemic removal of analogical hallucinations from transcriptions, thus decreasing Word Error Rate (WER). The BoH shows significant potential in reducing false outputs, effectively neutralizing 67.08% of misinterpretations when applied in conjunction with other detection strategies.

Application of BoH and Hallucination Mitigation

The practical utility of the Bag of Hallucinations is tested in transcriptions with mixed speech and non-speech contexts. Post-processing that utilizes the BoH effectively curtails hallucinations, especially when coupled with Voice Activity Detection (VAD). While adjusting Whisper model parameters offers limited relief from hallucinations, combining VAD and BoH post-processing delivers superior accuracy improvements, significantly optimizing transcription quality and reliability.

Conclusion

The investigation underscores the importance of addressing hallucinations in ASR systems like Whisper. Establishing a BoH as a mechanism for post-processing provides a pragmatic approach to suppressing potentially catastrophic transcription errors. This research not only delineates procedures to diagnose and ameliorate the hallucination problem in ASR but also highlights the importance of model robustness against adversarial audio signals. The findings point towards future work in enhancing ASR systems' discrimination capabilities and refining post-processing techniques to ensure consistent and reliable real-world application.

Markdown Report Issue