Diarization-Guided Silence Suppression
- Diarization-guided silence suppression is a decoding strategy that uses frame-level silence estimates to block spurious timestamp tokens in ASR models.
- It applies a threshold and guard band to accurately identify silence segments, reducing over-segmentation and improving utterance boundary placement.
- Empirical results demonstrate improved mtWER, AER, and DER, highlighting its effectiveness in enhancing joint ASR-diarization performance.
Diarization-guided silence suppression is an inference-time decoding strategy employed in joint end-to-end automatic speech recognition (ASR) and speaker diarization systems, specifically within frameworks that utilize serialized output training (SOT) paradigms as in Whisper-based encoder-decoder architectures. The method leverages frame-level silence/activity estimates from a diarization head to mask out spurious timestamp emissions in silence regions, thereby improving utterance boundary placement, temporal segmentation accuracy, and reducing over-segmentation—without altering the core ASR training objectives or loss landscape (Xu et al., 25 Jan 2026).
1. Motivation and Problem Statement
The placement of timestamp tokens is critical in serialized output ASR+diarization models, where the decoder emits explicit tokens (e.g., "<|t_start|>", "<|t_end|>") denoting spoken segment boundaries. In Whisper-style SOT architectures, erroneous insertion of timestamps into silence intervals often leads to timestamp drift, over-segmentation, and degraded temporal accuracy. Silence intervals are void of lexical content or speaker turns and are, therefore, inappropriate points for utterance boundaries or diarization changes.
To address this, diarization-guided silence suppression constrains the decoder: when the frame-level diarization module predicts a high probability of silence, the system explicitly blocks emission of timestamp tokens in those regions. The suppression is guided solely by the silence posterior, even though the diarization head outputs probabilities for all speaker-role classes ("child," "adult," "silence") (Xu et al., 25 Jan 2026).
2. Algorithmic Formulation
The diarization-guided silence suppression procedure operates as follows. Let denote the total number of encoder frames and, for each frame , let the diarization head produce a posterior vector: A frame is deemed "silence" if: where . Contiguous spans of successive silence frames are located, and mapped to time using a 20 ms/frame hop: , . Each silence window is further shrunken to with , to create a guard band around speech transitions, reducing the risk of suppressing genuine boundary tokens.
During beam search decoding, whenever a token candidate is a timestamp with numerical value falling inside any current shrunken silence window, its probability is set to zero:
1 2 3 |
for each candidate token tok in beam: if tok is a timestamp t_x and ∃i: t_i^start+δ ≤ t_x ≤ t_i^end−δ: P(tok) ← 0 |
3. Model Integration and Decoding Dynamics
Silence suppression is invoked exclusively at inference, directly after the frame-level diarization head (which is attached to the final encoder layer) computes the silence posteriors. As the SOT decoder proposes tokens, the suppression mask is reapplied at each beam search step to candidate timestamp tokens.
The method is fully compatible with a state-machine-based forced decoding framework. Specifically, within the constrained state space (S₂ and S₅ in the finite-state automaton), timestamp tokens are enumerated, but suppression operates orthogonally by zeroing out the tokens that are temporally misaligned with the silence mask, minimizing spurious state transitions and non-structural outputs (Xu et al., 25 Jan 2026).
4. Practical Implementation and Hyperparameters
Key implementation details include:
- Silence posterior threshold: (tuned on a development set).
- Guard band: seconds, applied at both ends of every silence segment.
- Frame rate: 20 ms per frame.
- The diarization head undergoes pretraining for up to 10 epochs (Adam optimizer, learning rate , weight decay $0.01$) on frame-level labels, followed by joint fine-tuning ( for Whisper-small, diarization loss weight ).
- No extra training loss is introduced; the approach is strictly a decoding-time heuristic. Careful tuning of and is employed to avoid over-suppression at true speech boundaries, with the aim of suppressing false alarms in silence, while not affecting legitimate transitions (Xu et al., 25 Jan 2026).
5. Experimental Assessment and Effects
Ablation results for diarization-guided silence suppression on the Playlogue and ADOS datasets (Whisper-small) show the following impact on key metrics:
| Method | mtWER | WER | AER | DER |
|---|---|---|---|---|
| Pretrained only | 37.8 % | 35.8 % | 2.0 % | 41.4 % |
| + silence suppression | 37.4 % | 35.5 % | 1.9 % | 40.6 % |
| Method | mtWER | WER | AER | DER |
|---|---|---|---|---|
| Pretrained only | 29.3 % | 28.3 % | 1.1 % | 23.6 % |
| + silence suppression | 28.8 % | 27.8 % | 1.0 % | 21.8 % |
Key empirical observations:
- Multi-talker word error rate (mtWER) decreases by 0.4–0.5 percentage points.
- Attributed error rate (AER) decreases by 0.1 percentage points.
- Diarization error rate (DER) improves by 0.8–1.8 percentage points, representing a substantial reduction in false-alarm errors in silence regions, and improved boundary alignment (Xu et al., 25 Jan 2026).
6. Broader Significance and Outlook
Diarization-guided silence suppression demonstrates a robust approach for leveraging model-internal acoustic structure (i.e., frame-level silence probability) to guide emission constraints during sequence decoding. This approach delivers measurable improvements in segmentation precision and multi-talker performance without necessitating modifications to the underlying ASR loss or model architecture. A plausible implication is that similar strategies could enhance other sequence labeling architectures where non-lexical states (e.g., silence, noise) must be rigorously decoupled from output tokenization. The method reinforces the practical viability of unified ASR-diarization models for scalable, speaker-attributed transcript generation in multi-party spoken interaction analysis (Xu et al., 25 Jan 2026).