Papers
Topics
Authors
Recent
Search
2000 character limit reached

Careless Whisper: Speech-to-Text Hallucination Harms

Published 12 Feb 2024 in cs.CL and cs.CY | (2402.08021v2)

Abstract: Speech-to-text services aim to transcribe input audio as accurately as possible. They increasingly play a role in everyday life, for example in personal voice assistants or in customer-company interactions. We evaluate Open AI's Whisper, a state-of-the-art automated speech recognition service outperforming industry competitors, as of 2023. While many of Whisper's transcriptions were highly accurate, we find that roughly 1\% of audio transcriptions contained entire hallucinated phrases or sentences which did not exist in any form in the underlying audio. We thematically analyze the Whisper-hallucinated content, finding that 38\% of hallucinations include explicit harms such as perpetuating violence, making up inaccurate associations, or implying false authority. We then study why hallucinations occur by observing the disparities in hallucination rates between speakers with aphasia (who have a lowered ability to express themselves using speech and voice) and a control group. We find that hallucinations disproportionately occur for individuals who speak with longer shares of non-vocal durations -- a common symptom of aphasia. We call on industry practitioners to ameliorate these language-model-based hallucinations in Whisper, and to raise awareness of potential biases amplified by hallucinations in downstream applications of speech-to-text models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (23)
  1. Fairness in machine learning. Nips tutorial 1 (2017), 2017.
  2. Fairness and Machine Learning: Limitations and Opportunities. MIT Press.
  3. David Frank Benson and Alfredo Ardila. 1996. Aphasia: A clinical perspective. Oxford University Press, USA.
  4. Hervé Bredin and Antoine Laurent. 2021. End-to-end speaker segmentation for overlap-aware resegmentation. In Proc. Interspeech 2021. Brno, Czech Republic.
  5. Chris Code and Brian Petheram. 2011. Delivering for aphasia. International Journal of Speech-Language Pathology 13, 1 (Feb. 2011), 3–10. https://doi.org/10.3109/17549507.2010.520090
  6. Antonio R. Damasio. 1992. Aphasia. New England Journal of Medicine 326, 8 (Feb. 1992), 531–539. https://doi.org/10.1056/nejm199202203260806
  7. Charles Ellis and Stephanie Urban. 2016. Age and aphasia: a review of presence, type, recovery and clinical outcomes. Topics in Stroke Rehabilitation 23, 6 (2016), 430–439. https://doi.org/10.1080/10749357.2016.1150412 arXiv:https://doi.org/10.1080/10749357.2016.1150412 PMID: 26916396.
  8. Gunther Eysenbach. 2023. The Role of ChatGPT, Generative Language Models, and Artificial Intelligence in Medical Education: A Conversation With ChatGPT and a Call for Papers. JMIR Medical Education 9 (March 2023), e46885. https://doi.org/10.2196/46885
  9. Graham R Gibbs. 2007. Thematic coding and categorizing. Analyzing qualitative data 703 (2007), 38–56.
  10. Survey of hallucination in natural language generation. Comput. Surveys 55, 12 (2023), 1–38.
  11. Racial disparities in automated speech recognition. Proceedings of the National Academy of Sciences 117, 14 (2020), 7684–7689.
  12. Quantifying and Improving the Performance of Speech Recognition Systems on Dysphonic Speech. Otolaryngology–Head and Neck Surgery 168, 5 (Jan. 2023), 1130–1138. https://doi.org/10.1002/ohn.170
  13. AphasiaBank: Methods for studying discourse. Aphasiology 25 (2011), 1286–1307.
  14. John Markoff. 2019. From Your Mouth to Your Screen, Transcribing Takes the Next Step. New York Times (October 2019).
  15. OpenAI. 2023a. GPT 3.5. https://platform.openai.com/docs/models/gpt-3-5. Accessed: 2023-11-25.
  16. OpenAI. 2023b. Speech to text. https://platform.openai.com/docs/guides/speech-to-text. Accessed: 2023-11-25.
  17. Augmented Datasheets for Speech Datasets and Ethical Decision-Making. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency. 881–904.
  18. Alexis Plaquet and Hervé Bredin. 2023. Powerset multi-class cross entropy loss for neural speaker diarization. In Proc. INTERSPEECH 2023.
  19. Robust Speech Recognition via Large-Scale Weak Supervision. arXiv (2022). arXiv:arXiv:2212.04356
  20. Donald B Rubin. 1980. Bias reduction using Mahalanobis-metric matching. Biometrics (1980), 293–298.
  21. Hallucination of Speech Recognition Errors With Sequence to Sequence Learning. IEEE/ACM Transactions on Audio, Speech, and Language Processing 30 (2022), 890–900. https://doi.org/10.1109/taslp.2022.3145313
  22. David Sherfinski and Avi Asher-Schapiro. 2021. U.S. prisons mull AI to analyze inmate phone calls. Thomson Reuters Foundation News (August 2021).
  23. Silero Team. 2021. Silero VAD: pre-trained enterprise-grade Voice Activity Detector (VAD), Number Detector and Language Classifier. https://github.com/snakers4/silero-vad.
Citations (15)

Summary

  • The paper quantifies Whisper's hallucination phenomenon, reporting a 1.4% occurrence rate with 38% of these cases being harmful.
  • The study uses a comparative analysis of speakers with aphasia and controls, linking longer non-vocal durations to increased transcription errors.
  • The findings underscore the urgency for robust safeguards in ASR systems, especially in high-stakes areas like legal and medical contexts.

Careless Whisper: Speech-to-Text Hallucination Harms

Introduction

Automated speech recognition (ASR) systems, such as OpenAI's Whisper, offer significant improvements over market competitors in transcription accuracy. However, despite its advancements, Whisper faces critical issues with hallucinations—transcriptions that contain nonsensical or fabricated text not present in the audio input. This paper investigates these hallucinations, particularly focusing on their prevalence among speakers with aphasia, and categorizes the types of harms they introduce.

Whisper’s Hallucination Phenomenon

Whisper's hallucinations are predominantly non-deterministic, meaning the same input might result in different hallucinatory outputs upon repeated uses. The study found that 1.4% of tested transcriptions contained hallucinated sections, with 38% of hallucinations deemed harmful (Figure 1). Figure 1

Figure 1

Figure 1: While some hallucinated text could be considered innocuous despite being incorrect, a concerning 38% of the hallucinated text falls under one of three identified harmful categories.

These hallucinations introduce three primary categories of harms:

  • Perpetuation of Violence: Includes text that suggests violence or demographic-based stereotyping, which can misrepresent the speaker.
  • Inaccurate Associations: Involves fictitious relationships or health statuses, leading to potential miscommunication.
  • False Authority: Includes language mimicking authoritative sources, potentially facilitating phishing.

Methodology

The research employs a comparison between audio data from speakers with aphasia and a control group. Whisper was used to transcribe 13,140 audio segments sourced from TalkBank’s AphasiaBank, focusing on English-speaking participants. Hallucinations were detected when discrepancies arose between Whisper transcriptions and the audio ground truth across multiple iterations.

Results and Analysis

Results indicate that hallucinations occurred more frequently in audio from speakers with aphasia compared to the control group (1.7% vs. 1.2%, respectively) (Figure 2). Aphasia speakers frequently exhibit longer non-vocal durations, which are strongly correlated with hallucinations, suggesting that Whisper is more prone to hallucinate when processing audio with longer pauses. Figure 2

Figure 2: Speakers with aphasia had audio files with significantly longer shares of non-vocal sounds, and higher hallucination rates.

Furthermore, a comparison with Google's ASR services and other market offerings revealed no similar hallucinations, suggesting unique issues within Whisper due to its integration with generative LLM technologies.

Implications and Future Work

The presence of hallucinations raises practical concerns about the reliability of Whisper in high-stakes applications, such as legal or medical settings. Given the risk of consequential misrepresentations, particularly for vulnerable populations like those with aphasia, addressing these inaccuracies is critical. Future research should prioritize improvements in generative AI systems to reduce randomness and explore mitigation strategies that include extensive participation from affected communities. Additionally, further exploration into the intersectionality of hallucination biases could provide deeper insights into demographic-specific impacts.

Conclusion

Whisper’s hallucinations highlight key challenges in the integration of generative AI models in speech-to-text applications. Tackling these challenges is essential for ensuring fair and accurate transcriptions across diverse speaker populations. For future advancements, organizations should incorporate more inclusive design practices and perform rigorous validations to safeguard against these potential harms.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 7 tweets with 100 likes about this paper.