SAP Corpus: English Impaired Speech Recordings
- SAP Corpus is a large-scale collection of over 400 hours of impaired English speech recordings with transcriptions from 524 speakers, covering multiple etiologies such as PD, ALS, DS, CP, and stroke.
- It features diverse speech materials including digital assistant commands, novel read sentences, and spontaneous utterances, ensuring both realistic and linguistically challenging scenarios for ASR development.
- The dataset supports robust, speaker-independent ASR evaluation through standardized splits, remote evaluation protocols, and benchmarking using both WER and composite semantic metrics (SemScore).
The Speech Accessibility Project (SAP) Corpus is a large-scale, speaker-diverse set of English speech recordings and transcriptions from individuals with a range of speech disabilities, notably hypokinetic dysarthria secondary to Parkinson’s disease (PD), as well as Down Syndrome (DS), amyotrophic lateral sclerosis (ALS), cerebral palsy (CP), and post-stroke conditions. SAP addresses the historical lack of publicly available impaired-speech datasets suitable for training and benchmarking robust, speaker-independent automatic speech recognition (ASR) models for the disabled-speech community (Singh et al., 25 Jan 2025, Zheng et al., 29 Jul 2025, Zheng et al., 2024).
1. Corpus Composition and Demographics
The SAP Corpus comprises more than 400 hours of impaired English speech and transcriptions from 524 unique speakers, making it substantially larger and more clinically diverse than prior publicly released corpora for dysarthric speech modeling (Zheng et al., 29 Jul 2025). Its primary constituency (by recording hours) consists of speakers with Parkinson’s disease—representing 73–76% of total duration across training and evaluation splits. Additional etiologies—ALS, DS, CP, and stroke—are explicitly included for multi-etiology research. Collected between April and December 2023 under IRB protocols and in collaboration with advocacy groups, the SAP corpus encompasses U.S., Canadian, and Puerto Rican English dialects.
Recordings use both remote and in-clinic modalities, standardized to 16 kHz mono audio. All data splits are strictly speaker-independent, with test sets (“unshared”) further restricting overlap by holding out utterances and prompts not seen in training (Zheng et al., 29 Jul 2025).
Table: SAP Corpus Summary (Post-Processed) (Zheng et al., 29 Jul 2025)
| Split | Speakers | Utterances | Duration (h) |
|---|---|---|---|
| Train | 369 | 131,420 | 290.35 |
| Dev | 55 | 19,275 | 43.56 |
| Test1 | 50 | 18,397 | 42.16 |
| Test2 | 50 | 17,752 | 38.77 |
Within earlier SAP-1005 releases, detailed for PD, 253 speakers provide 174.79 hours of audio, reflecting severity-stratified subgroups and maintaining balanced gender distribution within splits (Zheng et al., 2024, Singh et al., 25 Jan 2025).
2. Speech Material and Annotation Protocol
Speech materials within SAP cover three core types: (a) digital assistant commands (task-oriented, 67%), (b) lexically novel read sentences (22%), and (c) spontaneous prompted and conversational utterances (11%). This composition ensures both real-world and linguistically challenging scenarios (Singh et al., 25 Jan 2025).
Transcripts are produced under standardized guidelines: case normalization to all-uppercase, removal of all non-internal apostrophe punctuation, exclusion of bracketed PII/annotations, and generation of two versions—one retaining disfluencies and reparanda, one cleansed. Numbers and abbreviations undergo text normalization with NeMo followed by manual review. Per-utterance metadata encompasses etiology, location, and demographic details (where permitted) (Zheng et al., 29 Jul 2025).
Impairment severity for PD speakers in SAP-1005 is operationalized as the character error rate (CER) of an external baseline ASR model (wav2vec 2.0 fine-tuned on LibriSpeech-960h) on each speaker’s data. Severity bands are as follows: very low (VL, CER < 10%), low (L, 10–20%), medium (M, 20–40%), and high (H, ≥40%) (Zheng et al., 2024, Singh et al., 25 Jan 2025).
3. Access, Licensing, and Benchmarking
SAP is made available for noncommercial research under a Data User Agreement (DUA) requiring signed application and compliance with use restrictions (no redistribution). Remote evaluation protocols are enforced using EvalAI infrastructure, with protected test sets—researchers submit model outputs for central scoring (Zheng et al., 29 Jul 2025). The official portal is https://speechaccessibilityproject.beckman.illinois.edu/.
Benchmarking employs both literal and semantic ASR metrics:
- Word Error Rate (WER): , with test-level aggregation and utterance-level clipping at 100%.
- Semantic Score (SemScore): , with empirically fitted weights and components measuring logical entailment (MENLI / RoBERTa-large), semantic similarity (BERTScore ), and phonetic distance (Jaro-Winkler on Soundex codes) (Zheng et al., 29 Jul 2025).
Reference selection for each test utterance is automatic: WER and SemScore are computed between the hypothesis and both reference variants (with/without disfluencies), with the minimum WER and maximum SemScore retained.
4. ASR Modeling Methodologies and Results
State-of-the-art baselines and challenge submissions leverage large pretrained encoder–decoder Transformer architectures (e.g., OpenAI Whisper, NVIDIA Parakeet), with successful adaptation strategies including:
- Fine-tuning on SAP with speaker-independent splits, typically with SpecAugment and no speaker-dependent adapters (Singh et al., 25 Jan 2025, Zheng et al., 29 Jul 2025)
- Audio segmentation (sentence-level, VAD-based), hallucination mitigation by concatenating over-lapped segments (Singh et al., 25 Jan 2025)
- Parameter-efficient adaptation (LoRA, AdaLoRA), model merging/checkpoint averaging, curriculum learning, and personalized decoding (speaker-vector conditioning) (Zheng et al., 29 Jul 2025)
- Multi-task learning setups incorporating auxiliary severity or etiology prediction heads, yielding gains particularly in severe impairment subgroups (Zheng et al., 2024)
Performance on SAP-1005 (dev_unshared) achieves CER 6.99% and WER 10.71% using fine-tuned Whisper models; this reflects a ≈60% relative WER reduction compared to the wav2vec 2.0 baseline (Singh et al., 25 Jan 2025). The top 2025 SAP Challenge entry yielded WER 8.11% and SemScore 88.44% on the official test set, outperforming the Whisper-large-v2 baseline (WER 17.82%, SemScore 75.85%) (Zheng et al., 29 Jul 2025).
Cross-etiology transfer is limited: fine-tuning exclusively on PD (SAP-1005) results in WER increase on the TORGO corpus (spastic/flaccid/ataxic dysarthria) to 39.56% (Singh et al., 25 Jan 2025).
Table: WER by Training Corpus and Model (Zheng et al., 2024)
| Corpus/Model | WER (All) | WER (Male) | WER (Female) |
|---|---|---|---|
| LibriSpeech 100h FT | 42.53 | 47.96 | 34.85 |
| LibriSpeech 960h FT | 36.33 | 41.59 | 28.85 |
| SAP-1005 (Standard FT) | 26.92 | 30.08 | 22.45 |
| SAP-1005 Weighted/MTL | 26.53 | 13.36–13.28(VL) | (Lower for L,F) |
Severity-stratified MTL and loss-weighted fine-tuning further reduce WER, with auxiliary severity heads conferring largest benefits for high-impaired speakers.
5. Evaluation Metrics and Error Analysis
SAP evaluations rely on WER and CER for literal accuracy and SemScore for semantic fidelity. The challenge formulation clips utterance-level WER at 1.0 to robustly handle high-error speakers. SemScore offers a weighted ensemble of NLI-based entailment, semantic similarity, and phonetic distance, with linear weights (, , ) determined by cross-validation against human ratings (Zheng et al., 29 Jul 2025).
Common ASR error patterns on SAP include deletion of weak consonants in hypokinetic speech, vowel substitution due to centralization, and over-segmentation (fragmentation or repetitive hallucinations) on long spontaneous utterances, particularly in high-severity speakers and unstructured prompts (Singh et al., 25 Jan 2025). WER rises steeply from 7–8% in very low/low severity to 30% in high-severity PD. Similarly, spontaneous prompts yield WERs >19%, compared to 7–8% for task-oriented and novel-sentence reading.
Test set construction isolates “unshared” text cases (absent from train) to ensure authentic evaluation. A strong negative correlation () between WER and SemScore was observed, validating literal accuracy as a necessary (though not sufficient) condition for semantic equivalence (Zheng et al., 29 Jul 2025).
6. Research Impact and Future Directions
The SAP Corpus constitutes the largest and most methodologically rigorous English impaired-speech ASR benchmark to date, facilitating both robustness and generalizability studies. It enables:
- Development and evaluation of genuinely speaker-independent, etiology-agnostic ASR solutions for impaired speech.
- Direct benchmarking of cross-etiology transfer and zero-shot adaptation between dysarthria subtypes (PD, ALS, CP, DS, stroke).
- Severity-aware and multi-modal modeling paradigms, including ASR+impairment prediction and integration of visual or articulatory signals (Singh et al., 25 Jan 2025, Zheng et al., 2024).
Recommended future work includes collection of richer metadata (recording hardware, precise demographics), expansion to healthy controls for contrastive learning, systematic use of augmentation (SpecAugment, vocal-tract length perturbation), application of word-importance-based error metrics, and multi-task architectures coupling ASR with etiology and severity prediction (Singh et al., 25 Jan 2025, Zheng et al., 2024, Zheng et al., 29 Jul 2025, Kafle et al., 2018).
7. Connections to Broader Evaluation and Accessibility Research
The methodological rigor of SAP’s evaluation aligns with advances in ASR metrics from the spoken dialogue field, which advocate importance-weighted WER based on user-centered criteria (Kafle et al., 2018). The use of SemScore, a composite metric integrating logical, semantic, and phonetic similarity, responds to the argument that standard WER alone fails to capture communicative utility for DHH and impaired-speech users. Complementary annotation resources (e.g., Switchboard with human-assigned word importances) facilitate next-generation ASR evaluation protocols, furthering SAP’s central aim of robust, accessible, and clinically valid speech technology.