Papers
Topics
Authors
Recent
Search
2000 character limit reached

A Self-Training Approach for Whisper to Enhance Long Dysarthric Speech Recognition

Published 28 Jun 2025 in cs.SD and eess.AS | (2506.22810v1)

Abstract: Dysarthric speech recognition (DSR) enhances the accessibility of smart devices for dysarthric speakers with limited mobility. Previously, DSR research was constrained by the fact that existing datasets typically consisted of isolated words, command phrases, and a limited number of sentences spoken by a few individuals. This constrained research to command-interaction systems and speaker adaptation. The Speech Accessibility Project (SAP) changed this by releasing a large and diverse English dysarthric dataset, leading to the SAP Challenge to build speaker- and text-independent DSR systems. We enhanced the Whisper model's performance on long dysarthric speech via a novel self-training method. This method increased training data and adapted the model to handle potentially incomplete speech segments encountered during inference. Our system achieved second place in both Word Error Rate and Semantic Score in the SAP Challenge.

Summary

  • The paper introduces a novel self-training method where Whisper acts as a teacher to generate accurate pseudo-labels, improving long dysarthric speech segmentation.
  • It employs an effective segmentation strategy with heuristic filtering to handle incomplete speech segments and significantly reduce word error rates.
  • Experimental results using the SAP dataset and optimal beam decoding validate enhanced ASR performance and robust speaker-independent dysarthric speech recognition.

A Self-Training Approach for Whisper to Enhance Long Dysarthric Speech Recognition

Introduction

Automatic Speech Recognition (ASR) systems encounter challenges when dealing with dysarthric speech, characterized by impaired vocal control due to neurological conditions such as Parkinson’s disease and cerebral palsy. Traditional ASR models generally perform poorly on dysarthric speech, primarily because these models are trained on typical speech data, which lacks the diverse acoustic and prosodic variations present in dysarthric speech. Dysarthric speech recognition (DSR) has historically been constrained to command-level tasks due to the limited diversity of available datasets. With the emergence of the Speech Accessibility Project (SAP), a shift toward speaker-independent and text-independent DSR systems has been initiated, thereby promoting more advanced research in this domain.

The paper presents an innovative self-training approach designed to enhance Whisper's performance on long dysarthric speech. By leveraging the large and diverse SAP dataset, the researchers aimed to mitigate the data scarcity in DSR that typically hinders robust system development. This new methodology involves augmenting the training data, particularly catering to the handling of potentially incomplete speech segments during inference. Figure 1

Figure 1: The overview of our self-training approach to segment long dysarthric speech (ST-SLDS).

Self-Training Optimization

The proposed self-training approach addresses the challenge of segmenting long dysarthric speech (ST-SLDS). Initially, short dysarthric speech samples are used to fine-tune the Whisper model as a teacher model. This teacher model then generates pseudo-labels for the unused long dysarthric speech through a segmentation-based inference approach. The segmentation occurs prior to inference, where long speech is divided into segments that fit within the model’s positional encoding limits without leading to incomplete predictions.

Filtered data, meeting specific heuristic criteria (e.g., zero WER, zero insertion, and deletion errors), are selected for further processing. This involves annotation via segmentation algorithms, where precise timestamps are used for segment division, and corresponding predictions are used as labels. Through iterative training, the segmented data is merged into the training pool, progressively advancing the model's robustness and adaptability. Figure 2

Figure 2: Segmentation algorithm (a) and (b).

Experimental Results

Empirical evaluations were conducted using various advanced ASR models to form speaker-independent and text-independent DSR systems. The Whisper large-v3 model demonstrated superior performance, especially when employing an even segmentation (E-S) strategy during inference, which significantly improved the Word Error Rate (WER). The model was further refined through systematic experimentation with inference settings, including various segmentation and decoding strategies.

In evaluating different settings, a beam size of 10 was identified as optimal for decoding, with VAD-S strategy aiding better segmentation points compared to E-S. The proposed self-training method conspicuously enhanced performance by incrementally incorporating pseudo-labeled data filtered through a robust heuristic strategy. The inclusion of the SAP0430_processed dev partition in training brought additional improvement, although speaker complexity introduced challenges in aligning performance on new test datasets. Figure 3

Figure 3: The duration distribution of the training data in SAP0430_processed.

Implications and Future Directions

The findings underscore the potential of self-training methodologies in addressing the gaps in dysarthric speech datasets, particularly for long-segment speech recognition. Enhancing Whisper models through iterative self-training presents a viable solution to the mismatch between training and inference scenarios. This approach not only improves recognition accuracy for long dysarthric speech but also demonstrates resilience in adapting to the varied acoustic features inherent in dysarthric speech.

Looking ahead, further refinement of segmentation strategies and heuristic filtering could yield additional performance enhancements. Continued exploration into adapting VAD tools specifically for dysarthric speech may also facilitate more accurate segmentation, thereby optimizing the model’s efficacy.

Conclusion

The award-winning system advocated in this paper represents a significant step toward refining dysarthric speech recognition systems. By systematically increasing training data through self-training and addressing training-inference mismatches, the proposed methodology exhibits substantial advancements in model performance. These insights and techniques contribute to the broader field of ASR, offering avenues for future research in adaptive speech recognition systems for speech disorders.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

Explain it Like I'm 14

Overview

This paper is about helping computers understand speech from people with dysarthria, a condition that makes speaking clearly difficult. The researchers focus on long recordings (not just short words or commands) and improve a well-known speech recognition model called Whisper so it can better handle this challenging speech.

Goals

The paper asks two main questions:

  • How can we make Whisper recognize long, dysarthric speech more accurately?
  • Can we create a system that works well for many different speakers and sentences (not just for one person or one set of phrases)?

Methods (explained simply)

Think of teaching a computer like training a student:

  • First, you teach the student using clear, short examples with answers (short audio clips with correct text).
  • Then, you let the student try longer, harder examples. The student writes down what they thinks was said (these are “pseudo-labels”).
  • You keep only the parts the student seems confident about, and you add these to the training set to teach the student again.
  • You repeat this a few times so the student gets better.

That’s the idea behind the paper’s “self-training” approach. Here are the key parts:

  • Why long speech is hard: Whisper is designed to handle up to about 30 seconds at a time. Long recordings get cut off, which can make it miss words or finish sentences badly.
  • Segmentation (cutting audio into pieces): The team slices long recordings into shorter chunks before recognition.
    • Even Segmentation: Cut the audio into equal pieces (like slicing a long baguette into equal slices).
    • VAD Segmentation: Use a tool called Voice Activity Detection (VAD) to find parts where someone is actually speaking, then group those parts into chunks.
  • Decoding strategies (how the model chooses words):
    • Greedy Search: Pick the best next word each time, quickly.
    • Beam Search: Try several possible sentences at once and pick the best one (like exploring multiple paths before deciding).
    • Prompting: Feed the text from the previous chunk to help the next chunk stay consistent. This sometimes helps meaning but can hurt accuracy.
  • Teacher–Student Self-Training:
    • If a chunk has WER = 0 (perfect match), they trust it.
    • If WER isn’t zero but there are no extra or missing words (only substitutions), they still consider it useful.
    • 4. Label and add those good chunks to the training set.
    • 5. Train a “student” model on the bigger dataset. Make it the new teacher.
    • 6. Repeat a few times.
  • Matching the training format to Whisper: Whisper was originally trained on text where only the first letter is uppercase and the rest are lowercase (like “Hello world”). Training with the same format (“First-letter uppercase,” called F-U) gave better results than using ALL UPPERCASE (A-U).

Main Findings and Why They Matter

  • Whisper performs best among several speech models when long audio is split into chunks before recognition.
  • Using VAD-based segmentation + beam search (with a beam size around 10) + the F-U text format gave strong results.
  • The self-training approach steadily improved accuracy over several rounds. After about three iterations, performance was best; too many iterations started to cause overfitting (the model focuses too much on the training data and gets worse on new data).
  • The system ranked second in both Word Error Rate (WER) and Semantic Score (SemScore) in the SAP Challenge, a competition focused on dysarthric speech recognition.
    • WER: Lower is better; it’s like counting spelling mistakes compared to the true answer.
    • SemScore: Higher is better; it measures how similar the meaning is to the correct sentence, not just exact words.

These results matter because they show a practical way to make speech technology more accurate for people with speech disorders, especially for longer, real-world conversations.

Implications and Impact

  • Accessibility: Better recognition of dysarthric speech helps people control devices, write messages, and use smart assistants more easily.
  • General approach: The self-training plus smart segmentation strategy can be used to improve other models on long, difficult audio—not just Whisper.
  • Real-world readiness: By adapting training to match how the system actually works at test time (short chunks from long audio), the model becomes more reliable in everyday use.

In short, this research provides a clear, effective recipe to make speech recognition fairer and more usable for people with dysarthria, moving beyond short commands toward long, natural speech.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 1 like about this paper.