BERSting at the Screams: A Benchmark for Distanced, Emotional and Shouted Speech Recognition

Published 30 Apr 2025 in cs.CL, cs.SD, and eess.AS | (2505.00059v1)

Abstract: Some speech recognition tasks, such as automatic speech recognition (ASR), are approaching or have reached human performance in many reported metrics. Yet, they continue to struggle in complex, real-world, situations, such as with distanced speech. Previous challenges have released datasets to address the issue of distanced ASR, however, the focus remains primarily on distance, specifically relying on multi-microphone array systems. Here we present the B(asic) E(motion) R(andom phrase) S(hou)t(s) (BERSt) dataset. The dataset contains almost 4 hours of English speech from 98 actors with varying regional and non-native accents. The data was collected on smartphones in the actors homes and therefore includes at least 98 different acoustic environments. The data also includes 7 different emotion prompts and both shouted and spoken utterances. The smartphones were places in 19 different positions, including obstructions and being in a different room than the actor. This data is publicly available for use and can be used to evaluate a variety of speech recognition tasks, including: ASR, shout detection, and speech emotion recognition (SER). We provide initial benchmarks for ASR and SER tasks, and find that ASR degrades both with an increase in distance and shout level and shows varied performance depending on the intended emotion. Our results show that the BERSt dataset is challenging for both ASR and SER tasks and continued work is needed to improve the robustness of such systems for more accurate real-world use.

Abstract PDF Upgrade to Chat

Authors (9)

Summary

An Examination of the BERSt Dataset for Distanced, Emotional, and Shouted Speech Recognition

The manuscript titled "BERSting at the Screams: A Benchmark for Distanced, Emotional and Shouted Speech Recognition" introduces the BERSt dataset, presenting it as a novel and comprehensive resource tailored for benchmarking various speech recognition challenges in real-world environments. This essay explores the dataset's composition, benchmarks provided for ASR and SER tasks, and the implications of its findings.

The BERSt dataset bridges several gaps in existing resources for Automatic Speech Recognition (ASR). It features approximately 3.75 hours of English speech recorded by 98 actors in diverse and authentic home settings. Notably, the data encompasses a range of environments with smartphones positioned in 19 different locations, addressing the dearth of datasets for distanced and varied acoustic environments commonly encountered in real-world ASR applications.

The dataset is characterized by its inclusion of emotional and shouted speech. Spanning seven emotion prompts and including shouted utterances, BERSt presents a challenging benchmark for both ASR and Speech Emotion Recognition (SER). Historical datasets have often focused narrowly on distance, typically employing multi-microphone arrays. In contrast, the BERSt dataset is collected with single smartphone microphones, thus aligning more closely with typical consumer setups.

Benchmarking results provided in the paper illustrate the challenges imposed by the dataset. State-of-the-art ASR models, including Whisper-medium.en and Whisper-turbo, demonstrate notable performance declines with increased distance and volume, with Word Error Rates (WER) of up to 76.21% in some conditions. Performance variability observed across different emotions indicates further challenges in achieving robust recognition. Notably, ASR effectiveness decreases with heightened shouting levels and varies by emotional context, underscoring the dataset's complexity and potential for engendering more resilient ASR systems.

The implications of the BERSt dataset extend to practical advancements in ASR and SER. By focusing on realistic recording conditions, BERSt can promote the development of models better suited for real-life applications, such as enhancing smart device interaction in uncontrolled environments. Additionally, its emphasis on varied conditions encourages more comprehensive approaches in addressing the nuanced role of nonverbal cues in ASR and SER.

For future research, this dataset not only sets a new standard in evaluating ASR systems' robustness but also prompts the development of advanced algorithms capable of managing the inherent variability in spontaneous human speech. As the field progresses, insights gleaned from working with BERSt could drive innovations in AI systems that require nuanced understanding and processing of human emotions and speech under duress or at a distance.

The authors' effort in creating and benchmarking against this dataset paves the way for addressing the persistent difficulties in speech recognition systems, thus facilitating broader AI development in diverse real-world applications. This dataset's utility in refining models for public safety, user interaction, and beyond highlights its significant contribution to ongoing research in speech processing.

Markdown Report Issue