StutterFuse: Retrieval-Augmented Disfluency Detection
- StutterFuse is a retrieval-augmented classifier that leverages a non-parametric memory bank to detect complex, overlapping stuttering events with enhanced precision and recall.
- It employs SetCon, a Jaccard-weighted metric learning loss that structures the embedding space based on label overlap, effectively mitigating modality collapse.
- A gated fusion mechanism dynamically combines audio and retrieval expert outputs, achieving state-of-the-art multi-label detection and robust cross-domain, cross-lingual generalization.
StutterFuse is a retrieval-augmented classification framework for multi-label stuttering and disfluency detection in speech, integrating memory-augmented deep learning, Jaccard-weighted metric learning, and dynamic mixture-of-experts fusion. In contrast to conventional parametric approaches, StutterFuse incorporates a non-parametric memory bank of clinical audio exemplars, enabling classification by reference and mitigating the challenge of detecting overlapping and complex disfluencies. The architecture addresses "modality collapse"—where naive reliance on retrieval increases recall but erodes precision—by introducing a Jaccard-weighted metric loss (SetCon) and a gated expert fusion mechanism. StutterFuse achieves state-of-the-art multi-label detection and exhibits strong zero-shot cross-dataset and cross-lingual generalization (Singh et al., 15 Dec 2025).
1. Architectural Framework and Retrieval-Augmented Pipeline
StutterFuse comprises a three-stage inference and learning pipeline:
- Wav2Vec 2.0 Feature Extraction: Each 3 s audio segment is processed by a frozen Wav2Vec2-large-960h model, extracting frames of dimensional transformer hidden states, producing .
- SetCon Embedder and Memory Bank: The features are mapped to a normalized 1024-dimensional embedding via a BiGRU (256 units per direction) with attention and ReLU projection. These embeddings form a Faiss IndexFlatIP memory bank (unit-normalized, inner-product search equating to cosine similarity), holding k clinical and augmented example vectors.
- Retrieval-Augmented Classifier (RAC) and Gated Fusion:
- At inference, the query embedding retrieves nearest neighbors , along with similarity scores and ground-truth labels .
- Two fusion paradigms are implemented:
- Mid-Fusion (Cross-Attention): The query and neighbor representations are fused via Conformer-based cross-attention and MLP.
- Late-Fusion ("StutterFuse" configuration): Independent "audio" and "retrieval" experts are fused using a gating network, , with the fused vector input to the final classifier.
A schematic overview is shown below:
| Stage | Input | Output |
|---|---|---|
| Wav2Vec2 | 3 s audio ($16$ kHz) | () |
| SetCon+Faiss | (memory bank) | |
| Retrieval + Fusion | and | Multi-label stutter probabilities |
2. SetCon: Jaccard-Weighted Metric Learning
StutterFuse employs SetCon, a set-similarity contrastive loss that leverages continuous Jaccard overlap between multi-label targets to structure the embedding space. For anchor with embedding and positives ,
The SetCon loss is defined as
where , and is the temperature. This facilitates semantic structuring such that embeddings from samples with larger label set overlap cluster more closely, improving retrieval for complex, overlapping stuttering events.
3. Gated Mixture-of-Experts Fusion
The StutterFuse late-fusion classifier integrates two specialized experts:
- Audio Expert: Processes the query audio via a 2-block Conformer backbone to produce .
- Retrieval Expert: Processes retrieved neighbor embeddings via MLP → GlobalAvgPool → Dense, outputting .
- Gating Network: Computes .
The fused representation allows the model to dynamically arbitrate the contributions of acoustic evidence vs. retrieval context, mitigating error propagation from over-reliance on non-parametric neighbors ("echo chamber" or modality collapse).
4. Training Regimen and Hyperparameterization
The training pipeline proceeds in two distinct phases:
- Phase 1 (SetCon Embedder): Optimized with Adam (), batch size $4096$, , for 20 epochs with early stopping on Recall@5 (Jaccard ). Recall@5 improves from 0.32 (mean-pooled Wav2Vec2) to 0.47 with SetCon.
- Phase 2 (Classifier): Both mid- and late-fusion classifiers use AdamW (, weight decay), batch size $128$, binary cross-entropy loss with 0.1 label smoothing. Conformer details: 2 blocks, feed-forward dim 512, dropout 0.3 (Conformer) and 0.5 (MLP).
The Faiss memory bank is instance-balanced across the k original and k augmented examples.
5. Empirical Performance and Cross-Domain Robustness
StutterFuse was evaluated on multi-label disfluency detection with several configurations:
- SEP-28k (Speaker-Independent)
- Audio-Only Conformer baseline: weighted F1 (precision 0.66, recall 0.56)
- Mid-Fusion RAC: weighted F1 (precision 0.52, recall 0.82)
- Late-Fusion StutterFuse: weighted F1 (precision 0.60, recall 0.72)
- StutterFuse per-class F1: Prolongation 0.61, Block 0.66, SoundRep 0.54, WordRep 0.55, Interjection 0.77
- Zero-Shot Cross-Dataset: FluencyBank
- Weighted F1 , with StutterFuse and RAC identical.
- Relative gain over Audio-Only: SoundRep +7.5%, WordRep +6.6%.
- Zero-Shot Cross-Lingual: KSoF (German)
- English-to-German direct baseline: Block F1 .
- German-trained supervised topline: Block F1 .
- RAC (mid-fusion): Block F1 , weighted F1 .
- StutterFuse: Block F1 , weighted F1 .
In ablation studies, removing retrieval degraded F1 from 0.65 to 0.60. Disabling SetCon or neighbor metadata similarly reduced performance, establishing the necessity of each component.
6. Modality Collapse: Definition and Remediation
"Modality collapse" or the "echo chamber" effect arises when retrieval-enhanced classifiers overfit to the label structure of nearest-neighbor samples, boosting recall (from 0.56 to 0.82) but sacrificing precision (from 0.66 to 0.52). This occurs because retrieved neighbors, being stutter-rich, bias the model toward overpredicting disfluencies shared among them, even when the query differs.
Mitigations within StutterFuse include:
- SetCon: Constructs an embedding space that reflects partial set-overlap, supporting retrieval diversity; recall@5 increases from 0.32 to 0.47.
- Gated Fusion: Enables dynamic attenuation of retrieval in high-certainty acoustic conditions, recovering precision to 0.60 (F1 improves to 0.65).
Qualitative diagnostics illustrate that StutterFuse recovers complex overlapping labels when retrieval context is label-diverse, but can propagate false positives if all neighbors share the same unrepresentative label.
7. Relation to Prior Disfluency Detection and Fusion Pipelines
StutterFuse builds on prior work such as FluentNet (Kourkounakis et al., 2020), which applies a Squeeze-and-Excitation ResNet BLSTM Attention architecture for frame-level disfluency classification using STFT spectrograms. FluentNet leverages SE blocks for channel-wise spectral weighting, BLSTM for temporal structure, and attention for segment-level focus, achieving mean accuracy of and miss rate on UCLASS.
Despite achieving state-of-the-art results in single-label tasks, FluentNet and similar purely parametric approaches exhibit limitations in handling high-order label co-occurrence, label-imbalance, and infrequent complex overlaps. StutterFuse extends these capabilities by:
- Utilizing retrieval-augmented fused representations to reason about rare and complex stutter combinations.
- Structuring the latent space with SetCon for multi-label compatibility.
- Introducing dynamic expert gating to avoid modality-specific bias.
Recommendations derived from FluentNet, such as multi-modal fusion and streaming-friendly modifications, are compatible extensions for future StutterFuse designs.
StutterFuse defines a new class of Retrieval-Augmented Classifiers for multi-label stuttering detection, demonstrating that explicit retrieval, label-aware metric learning, and dynamic fusion jointly resolve critical limitations of previous methods and enable robust, cross-domain disfluency identification (Singh et al., 15 Dec 2025).