Careless Whisper: Speech-to-Text Hallucination Harms

Published 12 Feb 2024 in cs.CL and cs.CY | (2402.08021v2)

Abstract: Speech-to-text services aim to transcribe input audio as accurately as possible. They increasingly play a role in everyday life, for example in personal voice assistants or in customer-company interactions. We evaluate Open AI's Whisper, a state-of-the-art automated speech recognition service outperforming industry competitors, as of 2023. While many of Whisper's transcriptions were highly accurate, we find that roughly 1\% of audio transcriptions contained entire hallucinated phrases or sentences which did not exist in any form in the underlying audio. We thematically analyze the Whisper-hallucinated content, finding that 38\% of hallucinations include explicit harms such as perpetuating violence, making up inaccurate associations, or implying false authority. We then study why hallucinations occur by observing the disparities in hallucination rates between speakers with aphasia (who have a lowered ability to express themselves using speech and voice) and a control group. We find that hallucinations disproportionately occur for individuals who speak with longer shares of non-vocal durations -- a common symptom of aphasia. We call on industry practitioners to ameliorate these language-model-based hallucinations in Whisper, and to raise awareness of potential biases amplified by hallucinations in downstream applications of speech-to-text models.

Abstract PDF HTML Upgrade to Chat

References (23)

Citations (15)

View on Semantic Scholar

Summary

The paper quantifies Whisper's hallucination phenomenon, reporting a 1.4% occurrence rate with 38% of these cases being harmful.
The study uses a comparative analysis of speakers with aphasia and controls, linking longer non-vocal durations to increased transcription errors.
The findings underscore the urgency for robust safeguards in ASR systems, especially in high-stakes areas like legal and medical contexts.

Careless Whisper: Speech-to-Text Hallucination Harms

Introduction

Automated speech recognition (ASR) systems, such as OpenAI's Whisper, offer significant improvements over market competitors in transcription accuracy. However, despite its advancements, Whisper faces critical issues with hallucinations—transcriptions that contain nonsensical or fabricated text not present in the audio input. This paper investigates these hallucinations, particularly focusing on their prevalence among speakers with aphasia, and categorizes the types of harms they introduce.

Whisper’s Hallucination Phenomenon

Whisper's hallucinations are predominantly non-deterministic, meaning the same input might result in different hallucinatory outputs upon repeated uses. The study found that 1.4% of tested transcriptions contained hallucinated sections, with 38% of hallucinations deemed harmful (Figure 1).

Figure 1: While some hallucinated text could be considered innocuous despite being incorrect, a concerning 38% of the hallucinated text falls under one of three identified harmful categories.

These hallucinations introduce three primary categories of harms:

Perpetuation of Violence: Includes text that suggests violence or demographic-based stereotyping, which can misrepresent the speaker.
Inaccurate Associations: Involves fictitious relationships or health statuses, leading to potential miscommunication.
False Authority: Includes language mimicking authoritative sources, potentially facilitating phishing.

Methodology

The research employs a comparison between audio data from speakers with aphasia and a control group. Whisper was used to transcribe 13,140 audio segments sourced from TalkBank’s AphasiaBank, focusing on English-speaking participants. Hallucinations were detected when discrepancies arose between Whisper transcriptions and the audio ground truth across multiple iterations.

Results and Analysis

Results indicate that hallucinations occurred more frequently in audio from speakers with aphasia compared to the control group (1.7% vs. 1.2%, respectively) (Figure 2). Aphasia speakers frequently exhibit longer non-vocal durations, which are strongly correlated with hallucinations, suggesting that Whisper is more prone to hallucinate when processing audio with longer pauses.

Figure 2: Speakers with aphasia had audio files with significantly longer shares of non-vocal sounds, and higher hallucination rates.

Furthermore, a comparison with Google's ASR services and other market offerings revealed no similar hallucinations, suggesting unique issues within Whisper due to its integration with generative LLM technologies.

Implications and Future Work

The presence of hallucinations raises practical concerns about the reliability of Whisper in high-stakes applications, such as legal or medical settings. Given the risk of consequential misrepresentations, particularly for vulnerable populations like those with aphasia, addressing these inaccuracies is critical. Future research should prioritize improvements in generative AI systems to reduce randomness and explore mitigation strategies that include extensive participation from affected communities. Additionally, further exploration into the intersectionality of hallucination biases could provide deeper insights into demographic-specific impacts.

Conclusion

Whisper’s hallucinations highlight key challenges in the integration of generative AI models in speech-to-text applications. Tackling these challenges is essential for ensuring fair and accurate transcriptions across diverse speaker populations. For future advancements, organizations should incorporate more inclusive design practices and perform rigorous validations to safeguard against these potential harms.

Markdown Report Issue