Papers
Topics
Authors
Recent
Search
2000 character limit reached

FluencyBank: Multimodal Stuttering Speech Dataset

Updated 22 January 2026
  • FluencyBank is a multimodal speech dataset featuring comprehensive, expert-annotated disfluency and secondary behavior labels for stutter analysis.
  • It provides precise frame-level annotations and standard evaluation metrics like WER, MER, and segment-based F1 scores to benchmark speech models.
  • The dataset supports applications in inclusive ASR, real-time dysfluency monitoring, and automatic stuttering severity assessment for clinical research.

FluencyBank is a multimodal speech dataset designed to support empirical research on stuttering, with a particular focus on comprehensive, standardized annotation of speech disfluencies, secondary behaviors, and clinically relevant metadata. Initially developed by Ratner & MacWhinney (2018), FluencyBank provides a foundational corpus for tasks including automatic speech recognition (ASR) benchmarking on stuttered speech, dysfluency segmentation, and automatic stuttering severity assessment. Over time, the resource has been expanded and refined through the addition of expert-level annotations, frame-level boundary information, and nuanced label schemes based on clinical standards. FluencyBank and its derived extensions are widely used as evaluation benchmarks and baselines for machine learning systems aimed at inclusive speech technology and objective stuttering analysis.

1. Corpus Composition and Speaker Demographics

The core English FluencyBank corpus is composed of spontaneous and semi-structured speech samples from individuals who stutter (PWS) across a broad age range and stuttering severity spectrum. According to (Xu et al., 15 Jan 2026), the English subset utilized in recent ASR studies comprises:

  • 48 PWS speakers (21 female, 27 male), aged 10–70 years
  • 245 audio recordings (approximately 25 hours prior to filtering)
  • After exclusion of 128 highly fragmented child–adult dialogues, 117 recordings (approx. 13 hours) remain for analysis

FluencyBank++ (Ghosh et al., 4 Aug 2025), an expert-curated extension, draws from 4,144 three-second clips sampled from 33 spontaneous interview sessions (23 male, 10 female). After filtering, 3,017 five-second clips containing at least one dysfluency (as confirmed by expert annotators) constitute the evaluation subset, ensuring a high yield of target dysfluency events. The compilation process applies downsampling to 16 kHz mono and standard normalization to ensure comparability across recordings.

2. Annotation Protocols and Disfluency Labels

FluencyBank provides detailed manual, time-aligned annotations of diverse disfluency events. The original protocol includes the following types:

  • Sound repetitions (e.g., “p-p-passenger”)
  • Word/phrase repetitions
  • Prolongations (e.g., “ssssssand”)
  • Blocks (silent pauses or stuck syllables)
  • Interjections/fillers

Subsequent extensions, notably the clinical annotation scheme described by Valente et al. (Valente et al., 31 May 2025), introduce a clinically standardized taxonomy derived from the Lidcombe Behavioral Data Language (LBDL) and SSI-4 for secondary behaviors:

  • Primary stuttering labels:
    • SR (Syllable Repetition)
    • ISR (Incomplete Syllable Repetition)
    • MUR (Multisyllable Unit Repetition)
    • P (Prolongation)
    • B (Block)
  • Secondary behaviors:
    • V (Verbal, e.g., noisy breathing)
    • FG (Facial Grimaces)
    • HM (Head Movements)
    • ME (Movements of Extremities)
  • Tension scores: ordinal variable T{0,1,2,3}T \in \{0,1,2,3\} from “no tension” to “severe tension” (Boey et al., 2007)

Annotation is performed using ELAN, with all events marked as temporal spans synchronized to both waveform and transcript views. Strict marking rules and multi-tuple consensus review ensure temporal accuracy and reproducibility.

3. Frame-Level and Strong Supervision Advances

The FluencyBank++ resource advances the annotation granularity by providing frame-level, expert-validated boundaries for four principal dysfluency types: prolongations, repetitions (collapsing sound/word/syllable/phrase), blocks, and interjections (Ghosh et al., 4 Aug 2025). The annotation process follows three successive stages:

  1. Utterance-level correction and +/–1s extension to accommodate boundary truncations
  2. Frame-level start/end marking per event and per class (co-occurring events annotated separately)
  3. Consensus adjudication by three certified speech-language pathologists using a voting rule (agreement ≥2/3 per 0.1 s frame)

This pipeline results in a Fleiss’ κ of 0.71 post-consensus, demonstrating strong annotator agreement. The dataset comprises:

Dysfluency class # Clips Min Duration (s) Max Duration (s)
Interjection 1,130 0.12 1.88
Repetition 921 0.20 4.99
Block 530 0.23 4.20
Prolongation 436 0.41 3.95

All 3,017 clips contain at least one dysfluency with fluent segments marked between events. These strong labels enable windowed benchmarking (0.75 s window, 0.1 s stride) for segmentation tasks.

4. Benchmarking Methodologies and Evaluation Metrics

FluencyBank’s annotation depth supports a range of benchmarking paradigms. For ASR and dysfluency segmentation evaluations, standard and clinical metrics include:

  • Word Error Rate (WER): WER=S+I+DR\mathrm{WER} = \frac{S + I + D}{R}, with S=substitutions, I=insertions, D=deletions, R=reference words (Xu et al., 15 Jan 2026)
  • Match Error Rate (MER): identical numerator as WER, but with position-aware alignment
  • Word Information Lost (WIL): WIL=S+DR+S\mathrm{WIL} = \frac{S + D}{R + S}
  • Sentence-BERT cosine similarity: semantic similarity of ASR outputs
  • Time-F1 (t-F1) and time-recall (t-recall): segment overlap via Intersection-over-Union (IoU > 0.5) (Ghosh et al., 4 Aug 2025)
  • Onset error: absolute difference (in seconds) between predicted and ground-truth dysfluency onset (Ghosh et al., 4 Aug 2025)
  • Segment-based F1 scores for primary/secondary behaviors and tension, using sliding 5 s windows and majority voting across annotators (Valente et al., 31 May 2025)

Pretraining on foundation models (Whisper, WavLM Large, wav2vec 2.0) is a common feature-extraction strategy, with chosen layers providing the input embeddings for each window.

5. Observed Performance, Data Splits, and Model Use

The dataset is primarily reserved as a held-out evaluation benchmark. Notably, in (Xu et al., 15 Jan 2026), the 117-cleaned recordings were used in five repeated evaluation runs (different random seeds, no train/validation/test split), and in fine-tuning, all repaired outputs were added to the training set. FluencyBank++ (Ghosh et al., 4 Aug 2025) is strictly held-out and used exclusively for evaluation of segmentation models.

Performance benchmarks on FluencyBank (117 recordings, 48 speakers) include the following (average over four ASR models, original vs. Steamroller-repaired input) (Xu et al., 15 Jan 2026):

Metric Original Repaired
WER (%) 31.6 18.7
MER (%) 27.4 17.6
WIL (%) 35.5 22.8
SemSim (cosine) 0.81 0.91

WER, MER, and WIL are also reported by stuttering severity (mild, moderate, severe). For example, severe stuttering yields a WER drop from 33.5% to 22.6% after repair.

For segmentation, FluencyBank++ supports the benchmarking of graph-based models such as StutterCut, using windowed acoustic embeddings and strong frame-level ground truth. Multi-modal stuttering severity assessment (audio+video) demonstrates segment-F1 up to 0.95 (“Any” disfluency), with modality-specific F1 maxima for different behavior types (Valente et al., 31 May 2025).

6. Limitations, Comparison to Prior Versions, and Best Practices

FluencyBank and its extensions represent significant advances relative to previous corpora, but known limitations persist [(Ghosh et al., 4 Aug 2025) ; (Valente et al., 31 May 2025)]:

  • Original FluencyBank: only utterance-level disfluency annotation (five classes)
  • Frame-level annotations limited to 3,017 clips in FluencyBank++
  • Class imbalance: e.g., interjections ≈2× more frequent than prolongations
  • Fuzzy transitions between dysfluency types can lead to boundary ambiguity
  • Annotator reliability for tension scoring remains low (Krippendorff’s α for tension = 0.18; higher for temporal and primary-type labels)
  • Best practice: data-level class balancing or cost-sensitive loss for robust model training; augment with synthetic/weakly labeled data when feasible

A plausible implication is that although FluencyBank++ and clinical augmentation provide unprecedented granularity, additional work is required for cross-dataset/real-world generalization, particularly for rare classes and co-occurring events.

7. Research Applications and Future Directions

FluencyBank resources enable a spectrum of downstream applications:

  • Inclusive ASR: benchmarking and fine-tuning speech systems explicitly for PWS populations (Xu et al., 15 Jan 2026)
  • Real-time dysfluency monitoring: low-latency, frame-resolved stuttering detection in assistive or therapist-facing tools (Ghosh et al., 4 Aug 2025)
  • Automatic severity assessment: clinically grounded multi-label and multi-modal evaluations with direct alignment to clinical practice (Valente et al., 31 May 2025)
  • Synthesis and augmentation: leveraging real dysfluency boundaries and behaviors for training accessibility-focused models or multilingual transfer
  • Transfer studies: evaluating gold-standard, expert-validated boundaries as a testbed for domain adaptation and cross-language dysfluency segmentation

Extensions such as FluencyBank++ and enriched clinical annotation further support objective comparison of segmentation algorithms, alignment with clinical gold standards, and the iterative refinement of annotation pipelines for both research and clinical deployment. Recommended future work includes iterative annotation cycles for reliability, diverse demographic sampling, and continued open-source sharing of annotation manuals and tools (Valente et al., 31 May 2025).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FluencyBank Dataset.