Papers
Topics
Authors
Recent
Search
2000 character limit reached

SAP Corpus: English Impaired Speech Recordings

Updated 20 February 2026
  • SAP Corpus is a large-scale collection of over 400 hours of impaired English speech recordings with transcriptions from 524 speakers, covering multiple etiologies such as PD, ALS, DS, CP, and stroke.
  • It features diverse speech materials including digital assistant commands, novel read sentences, and spontaneous utterances, ensuring both realistic and linguistically challenging scenarios for ASR development.
  • The dataset supports robust, speaker-independent ASR evaluation through standardized splits, remote evaluation protocols, and benchmarking using both WER and composite semantic metrics (SemScore).

The Speech Accessibility Project (SAP) Corpus is a large-scale, speaker-diverse set of English speech recordings and transcriptions from individuals with a range of speech disabilities, notably hypokinetic dysarthria secondary to Parkinson’s disease (PD), as well as Down Syndrome (DS), amyotrophic lateral sclerosis (ALS), cerebral palsy (CP), and post-stroke conditions. SAP addresses the historical lack of publicly available impaired-speech datasets suitable for training and benchmarking robust, speaker-independent automatic speech recognition (ASR) models for the disabled-speech community (Singh et al., 25 Jan 2025, Zheng et al., 29 Jul 2025, Zheng et al., 2024).

1. Corpus Composition and Demographics

The SAP Corpus comprises more than 400 hours of impaired English speech and transcriptions from 524 unique speakers, making it substantially larger and more clinically diverse than prior publicly released corpora for dysarthric speech modeling (Zheng et al., 29 Jul 2025). Its primary constituency (by recording hours) consists of speakers with Parkinson’s disease—representing 73–76% of total duration across training and evaluation splits. Additional etiologies—ALS, DS, CP, and stroke—are explicitly included for multi-etiology research. Collected between April and December 2023 under IRB protocols and in collaboration with advocacy groups, the SAP corpus encompasses U.S., Canadian, and Puerto Rican English dialects.

Recordings use both remote and in-clinic modalities, standardized to 16 kHz mono audio. All data splits are strictly speaker-independent, with test sets (“unshared”) further restricting overlap by holding out utterances and prompts not seen in training (Zheng et al., 29 Jul 2025).

Table: SAP Corpus Summary (Post-Processed) (Zheng et al., 29 Jul 2025)

Split Speakers Utterances Duration (h)
Train 369 131,420 290.35
Dev 55 19,275 43.56
Test1 50 18,397 42.16
Test2 50 17,752 38.77

Within earlier SAP-1005 releases, detailed for PD, 253 speakers provide 174.79 hours of audio, reflecting severity-stratified subgroups and maintaining balanced gender distribution within splits (Zheng et al., 2024, Singh et al., 25 Jan 2025).

2. Speech Material and Annotation Protocol

Speech materials within SAP cover three core types: (a) digital assistant commands (task-oriented, 67%), (b) lexically novel read sentences (22%), and (c) spontaneous prompted and conversational utterances (11%). This composition ensures both real-world and linguistically challenging scenarios (Singh et al., 25 Jan 2025).

Transcripts are produced under standardized guidelines: case normalization to all-uppercase, removal of all non-internal apostrophe punctuation, exclusion of bracketed PII/annotations, and generation of two versions—one retaining disfluencies and reparanda, one cleansed. Numbers and abbreviations undergo text normalization with NeMo followed by manual review. Per-utterance metadata encompasses etiology, location, and demographic details (where permitted) (Zheng et al., 29 Jul 2025).

Impairment severity for PD speakers in SAP-1005 is operationalized as the character error rate (CER) of an external baseline ASR model (wav2vec 2.0 fine-tuned on LibriSpeech-960h) on each speaker’s data. Severity bands are as follows: very low (VL, CER < 10%), low (L, 10–20%), medium (M, 20–40%), and high (H, ≥40%) (Zheng et al., 2024, Singh et al., 25 Jan 2025).

3. Access, Licensing, and Benchmarking

SAP is made available for noncommercial research under a Data User Agreement (DUA) requiring signed application and compliance with use restrictions (no redistribution). Remote evaluation protocols are enforced using EvalAI infrastructure, with protected test sets—researchers submit model outputs for central scoring (Zheng et al., 29 Jul 2025). The official portal is https://speechaccessibilityproject.beckman.illinois.edu/.

Benchmarking employs both literal and semantic ASR metrics:

  • Word Error Rate (WER): WER=(S+D+I)/NWER = (S+D+I)/N, with test-level aggregation and utterance-level clipping at 100%.
  • Semantic Score (SemScore): SemScore=αScorenli+βScoreBERT+γScoresoundexSemScore = \alpha Score_{nli} + \beta Score_{BERT} + \gamma Score_{soundex}, with empirically fitted weights and components measuring logical entailment (MENLI / RoBERTa-large), semantic similarity (BERTScore F1F_1), and phonetic distance (Jaro-Winkler on Soundex codes) (Zheng et al., 29 Jul 2025).

Reference selection for each test utterance is automatic: WER and SemScore are computed between the hypothesis and both reference variants (with/without disfluencies), with the minimum WER and maximum SemScore retained.

4. ASR Modeling Methodologies and Results

State-of-the-art baselines and challenge submissions leverage large pretrained encoder–decoder Transformer architectures (e.g., OpenAI Whisper, NVIDIA Parakeet), with successful adaptation strategies including:

Performance on SAP-1005 (dev_unshared) achieves CER 6.99% and WER 10.71% using fine-tuned Whisper models; this reflects a ≈60% relative WER reduction compared to the wav2vec 2.0 baseline (Singh et al., 25 Jan 2025). The top 2025 SAP Challenge entry yielded WER 8.11% and SemScore 88.44% on the official test set, outperforming the Whisper-large-v2 baseline (WER 17.82%, SemScore 75.85%) (Zheng et al., 29 Jul 2025).

Cross-etiology transfer is limited: fine-tuning exclusively on PD (SAP-1005) results in WER increase on the TORGO corpus (spastic/flaccid/ataxic dysarthria) to 39.56% (Singh et al., 25 Jan 2025).

Corpus/Model WER (All) WER (Male) WER (Female)
LibriSpeech 100h FT 42.53 47.96 34.85
LibriSpeech 960h FT 36.33 41.59 28.85
SAP-1005 (Standard FT) 26.92 30.08 22.45
SAP-1005 Weighted/MTL 26.53 13.36–13.28(VL) (Lower for L,F)

Severity-stratified MTL and loss-weighted fine-tuning further reduce WER, with auxiliary severity heads conferring largest benefits for high-impaired speakers.

5. Evaluation Metrics and Error Analysis

SAP evaluations rely on WER and CER for literal accuracy and SemScore for semantic fidelity. The challenge formulation clips utterance-level WER at 1.0 to robustly handle high-error speakers. SemScore offers a weighted ensemble of NLI-based entailment, semantic similarity, and phonetic distance, with linear weights (α=0.40\alpha=0.40, β=0.28\beta=0.28, γ=0.32\gamma=0.32) determined by cross-validation against human ratings (Zheng et al., 29 Jul 2025).

Common ASR error patterns on SAP include deletion of weak consonants in hypokinetic speech, vowel substitution due to centralization, and over-segmentation (fragmentation or repetitive hallucinations) on long spontaneous utterances, particularly in high-severity speakers and unstructured prompts (Singh et al., 25 Jan 2025). WER rises steeply from 7–8% in very low/low severity to 30% in high-severity PD. Similarly, spontaneous prompts yield WERs >19%, compared to 7–8% for task-oriented and novel-sentence reading.

Test set construction isolates “unshared” text cases (absent from train) to ensure authentic evaluation. A strong negative correlation (ρ0.965\rho \approx -0.965) between WER and SemScore was observed, validating literal accuracy as a necessary (though not sufficient) condition for semantic equivalence (Zheng et al., 29 Jul 2025).

6. Research Impact and Future Directions

The SAP Corpus constitutes the largest and most methodologically rigorous English impaired-speech ASR benchmark to date, facilitating both robustness and generalizability studies. It enables:

  • Development and evaluation of genuinely speaker-independent, etiology-agnostic ASR solutions for impaired speech.
  • Direct benchmarking of cross-etiology transfer and zero-shot adaptation between dysarthria subtypes (PD, ALS, CP, DS, stroke).
  • Severity-aware and multi-modal modeling paradigms, including ASR+impairment prediction and integration of visual or articulatory signals (Singh et al., 25 Jan 2025, Zheng et al., 2024).

Recommended future work includes collection of richer metadata (recording hardware, precise demographics), expansion to healthy controls for contrastive learning, systematic use of augmentation (SpecAugment, vocal-tract length perturbation), application of word-importance-based error metrics, and multi-task architectures coupling ASR with etiology and severity prediction (Singh et al., 25 Jan 2025, Zheng et al., 2024, Zheng et al., 29 Jul 2025, Kafle et al., 2018).

7. Connections to Broader Evaluation and Accessibility Research

The methodological rigor of SAP’s evaluation aligns with advances in ASR metrics from the spoken dialogue field, which advocate importance-weighted WER based on user-centered criteria (Kafle et al., 2018). The use of SemScore, a composite metric integrating logical, semantic, and phonetic similarity, responds to the argument that standard WER alone fails to capture communicative utility for DHH and impaired-speech users. Complementary annotation resources (e.g., Switchboard with human-assigned word importances) facilitate next-generation ASR evaluation protocols, furthering SAP’s central aim of robust, accessible, and clinically valid speech technology.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Speech Accessibility Project (SAP) Corpus.