Speaker Diarization & Best-Speaker Selection
- Speaker diarization with best-speaker selection is the process of grouping speech segments by speaker identity and choosing the most representative segment for robust downstream processing.
- Techniques employ deep learning, self-attention, and bounded optimization methods to accurately handle distant recordings, speaker overlaps, and real-time constraints.
- Empirical results show significant improvements in diarization error rates and computational efficiency, setting strong baselines for offline and online systems.
Speaker diarization with best-speaker selection concerns the automated grouping of speech segments by speaker identity, with the explicit goal of selecting, for each speaker, the most representative or highest-quality segment for downstream processing. This paradigm is important for robust multi-speaker transcription, speaker-adaptive modeling, and enhancing the interpretability and utility of diarization outputs. State-of-the-art systems deploy deep learning and optimization-based methods to address the challenges posed by distant recordings, extensive speaker overlap, and the need for real-time inference in meetings of substantial duration and participant count.
1. Distant Speaker Diarization Architectures
Recent diarization systems integrate spatial, spectral, and explicit speaker representations to improve performance in far-field, multi-speaker environments. The ASoBO system processes audio captured by an 8-element uniform circular array, using a bank of super-directive beamformers. Each beamformer is steered to a pre-specified direction in , with empirical selection of the number of spatial filters ( on AMI, on AISHELL-4), to maximize diarization efficacy. Beamformer weights are derived under isotropic noise assumptions:
where is the frequency-dependent steering vector for the th direction and .
ASoBO feeds the resulting beamformed spectra into neural selection modules, while SCDiar operates in the streaming setting, working with VAD-detected segments of at most seconds, and extracts both lexical and speaker-embedding features for segment processing (Mariotte et al., 2024, Zheng et al., 28 Jan 2025).
2. Best-Speaker Selection via Self-Attention or Optimization
The key methodological distinction in “best-speaker” selection is the move from exhaustive clustering or voting over all segments to a principled extraction of the most representative segment per speaker. In the ASoBO framework, a self-attention channel combinator (SACC) learns, from the magnitudes of beamformed spectra , to output per-beamformer weights such that the resulting weighted sum emphasizes the dominant speaker(s) in each time frame. The normalized attention weights thereby softly select the best spatial channel(s) as a function of time and spectral features.
Conversely, SCDiar explicitly constructs a segment-token affinity matrix
computing the fit between token-level and segment-level speaker embeddings. The optimal subset of representative segments is found by solving a bounded nonnegative least squares problem:
with thresholding () selecting the final set. This enforces full token coverage and avoids redundancy or overlap in speaker modeling (Zheng et al., 28 Jan 2025).
3. Integration with Speech Activity and Speaker Change Detection
Both systems tightly couple best-speaker selection with activity and segmentation networks. In ASoBO, Mel-spectrogram features (dimension 64) generated from the attended beamformer combination are fed into a temporal convolutional network (TCN) that jointly predicts voice activity detection (VAD) and overlapped speech detection (OSD) on a per-frame basis. The TCN produces three classes (no speaker, single speaker, overlap), facilitating precise segmentation even with significant overlap.
SCDiar introduces a speaker change detection (SCD) module operating at the token level, leveraging ASR encoder features, speaker-feature extractors (CAM++), and integrate-and-fire (CIF) token embeddings. The SCD module identifies change points via a BiLSTM-CNN network with softmax output and focal loss. Identified boundaries yield segments, which are then filtered for length and quality before best-speaker selection (Mariotte et al., 2024, Zheng et al., 28 Jan 2025).
4. End-to-End Diarization and Assignment Pipelines
The diarization process is realized through a multistage pipeline:
- Segmentation: Using the output of VAD and OSD (ASoBO) or SCD (SCDiar), the raw audio is split into homogeneous speech turns or segments.
- Embedding extraction and representation: ASoBO uses x-vectors from a ResNet-101 for clustering, while SCDiar employs specialized token and segment speaker-embeddings.
- Clustering and assignment: ASoBO follows a VBx pipeline (hierarchical agglomerative clustering followed by Variational Bayes HMM) for initial grouping; SCDiar bypasses explicit clustering by assigning every token to its closest representative segment, as determined by the affinity matrix solution.
- Overlap handling: In ASoBO, frames marked as overlapped by OSD receive a secondary speaker assignment using proximity heuristics, whereas SCDiar ensures single coverage per token via the optimization step.
- Output: Both systems produce per-frame or per-token speaker labels suitable for transcript annotation or further processing.
5. Empirical Results and Comparative Performance
ASoBO and SCDiar have been evaluated on substantial English and Mandarin meeting corpora:
| System | Task | Dataset | Metric | Score / Improvement |
|---|---|---|---|---|
| ASoBO | Diarization | AISHELL-4 | DER () | 14.5% (best) |
| SCDiar | Diarization | AISHELL-4 | WDER | 3.56 (↓54.6% vs CoreSample+VBx) |
| ASoBO | Segmentation | AMI | SER | 6.53% |
| SCDiar | Diarization | In-house | WDER | 15.36 (↓20.1% vs CoreSample+VBx) |
| ASoBO | Overlap assign | AISHELL-4 | DER | 20.9% (after OSD assignment) |
ASoBO achieves a diarization error rate (DER) of 14.5% (collar 0) on AISHELL-4, outperforming both single-microphone and direct-channel SACC baselines. Its explainability is established via alignment between learned attention weights and true speaker directions on spatialized LibriMix mixtures. SCDiar cuts WDER from 7.84 to 3.56 (AISHELL-4) relative to the best prior online system, and from 19.23 to 15.36 on in-house 10+ speaker meetings, closing the gap with offline upper bounds within approximately 1% (Mariotte et al., 2024, Zheng et al., 28 Jan 2025).
6. Computational Efficiency and Real-Time Constraints
SCDiar achieves real-time operation with an aggregate real-time factor of on standard hardware, by bounding VAD chunk durations and leveraging efficient embedding computations and optimization solvers. The design avoids expensive online clustering by restricting core computations to affinity evaluation and bounded least-squares segment selection. ASoBO’s beamformer selection and attention modules have lightweight parameterizations (e.g., SACC 0.40M parameters), supporting their deployment in streaming or low-latency contexts.
7. Significance, Limitations, and Future Directions
Best-speaker selection strategies resolve many traditional diarization challenges associated with redundant, short, or ambiguous speech turns. By explicitly modeling representativeness and enforcing non-redundancy (SCDiar) or spectrotemporal dominance (ASoBO), these systems enhance purity and accuracy, especially in online and large meeting scenarios. A significant finding is that learned self-attention weights in ASoBO serve as a soft, unsupervised estimator of direction of arrival (DoA), improving explainability.
A plausible implication is that further gains could accrue by integrating explicit speaker recognition objectives with the selection optimization to ensure maximal discriminability. Limitations include the dependency on upstream ASR quality and the need for finely-tuned thresholds for SCDiar’s segmentation-selection pipeline.
Collectively, these advances have established strong new baselines for both offline and online speaker diarization with best-speaker selection, demonstrating state-of-the-art accuracy on real-world, high-participant meeting data (Mariotte et al., 2024, Zheng et al., 28 Jan 2025).