Streaming Sortformer: Real-Time Speaker Diarization
- The paper introduces Streaming Sortformer, which employs an Arrival-Order Speaker Cache (AOSC) to dynamically allocate embeddings for robust, real-time speaker tracking.
- It processes fixed-length audio chunks via a Transformer encoder and score-based selection that preserves arrival-time ordering while maintaining low latency.
- Empirical evaluations demonstrate high diarization accuracy and operational efficiency, outperforming baselines on benchmarks like DIHARD and CALLHOME.
Streaming Sortformer is a streaming extension of the Sortformer speaker diarization framework, designed for real-time multi-speaker tracking with arrival-time ordering of output speakers. The framework introduces the Arrival-Order Speaker Cache (AOSC), which stores frame-level acoustic embeddings for previously observed speakers, ordered by their index corresponding to arrival sequence. This mechanism dynamically allocates embeddings per speaker based on a score-driven selection process, ensuring efficient cache utilization and robust speaker tracking under low-latency conditions. Streaming Sortformer achieves high diarization accuracy and operational flexibility, establishing itself as a robust foundation for streaming multi-talker speech processing (Medennikov et al., 24 Jul 2025).
1. Architectural Overview
Streaming Sortformer processes input audio in fixed-length chunks, using a pre-encoder called NEST to extract frame-level Mel-spectrogram embeddings. Each chunk , where is the number of frames and is the embedding dimension (typically 512), is combined at each iteration with:
- : the AOSC output (up to frames)
- : a FIFO queue containing recent embeddings
- : current chunk plus frames of right context
These are concatenated into the inference sequence and processed by a stack of Transformer encoder layers. Speaker activity predictions for up to speakers are computed, with predictions emitted for only the current chunk. When an old chunk is removed from the FIFO , its embeddings are used to update the AOSC module, thereby maintaining the cache according to the arrival order and salience of speaker frames.
2. Arrival-Order Speaker Cache (AOSC)
AOSC is a fixed-size buffer maintaining frame-level NEST embeddings for speakers ordered by their arrival. It partitions the buffer into variable-length blocks for each active speaker , storing a list of embeddings and inserting an averaged silence embedding after each speaker block. The cache is concatenated as , with the total number of frames not exceeding . Speaker blocks always occupy contiguous positions, reflecting both arrival slot assignment and local temporal order within each block.
3. Dynamic Update and Embedding Selection Mechanism
Upon arrival of a new popped chunk , AOSC compresses the concatenation of existing cache and to frames using a score-based selection procedure:
- Speaker scores for frame and speaker :
where is the Sortformer probability of speaker at frame .
- Non-speech frames (where , e.g., ) are averaged into .
- For speaker blocks, frames with are masked out ().
- Recency boost (, e.g., $0.05$) is applied to embeddings from the recent chunk.
- Speaker-balance boost: Top strong-score frames per speaker receive a positive adjustment (e.g., ), enforcing a minimum number of embeddings per speaker.
- Infinite-score slots per speaker correspond to copies of , guaranteeing learnable silence transitions.
- The highest-score indices are selected, ordered by ascending speaker index and temporal order within speaker blocks.
This dynamic allocation results in the number of embeddings per speaker being data-driven but with a guaranteed floor for active speakers.
4. Computational Complexity and Latency
Streaming Sortformer operates efficiently in real time:
- Transformer encoder self-attention is applied over frames with computational cost .
- AOSC scoring across frames incurs cost.
- Top- selection involves operations.
Empirical measurements on an NVIDIA RTX 6000 Ada show the following real-time factors (RTF):
- RTF = 0.005 at 10 s latency
- RTF = 0.093 at 1.04 s latency
- RTF = 0.18 at 0.32 s latency
Memory usage for cache and FIFO is a few hundred kilobytes.
5. Training Objectives and Implementation
Offline Sortformer is first trained with permutation-invariant training (PIT) via binary cross-entropy on speaker activity streams and a "Sort Loss" enforcing speaker sequence ordering according to the ground-truth first speech frame. Streaming Sortformer is fine-tuned with windowed training (15 s steps, cache size 188 frames), using 90 s training segments with up to 4 speakers from 5150 h simulated and 2030 h real data. No SpecAugment or noise/RIR is used; robustness stems from corpus diversity and cache mechanism. Augmentations include random permutation of speaker cache slots and 50% right-context dropout per step.
6. Experimental Evaluation
Streaming Sortformer is trained on Fisher, AMI-IHM, DIHARD III Dev, VoxConverse, ICSI, AISHELL-4, CALLHOME Part 1, AliMeeting, DiPCo. Evaluation uses DIHARD III Eval, CALLHOME Part 2, and CH109. Metrics include Diarization Error Rate (DER) with overlap, latency (chunk size + right context), and RTF. Baselines encompass BW-EDA-EEND (10 s latency), EEND-EDA+FW-STB, EEND-GLA+BW-STB, FS-EEND+VCT, LS-EEND (all 1 s latency), and offline Sortformer (infinite-latency).
Results:
- CALLHOME (all speakers): Streaming Sortformer-AOSC 13.32% DER vs. 14.93% (EEND-EDA+STB), 12.11% (LS-EEND)
- DIHARD (all speakers): 18.97% DER vs. 25.09% (EEND-EDA+STB), 19.61% (LS-EEND)
- Sub-second latency (0.32 s) yields only modest DER degradation: 19.32% (DIHARD), 11.50% (CALLHOME)
Streaming Sortformer surpasses its offline counterpart on long (>5 speakers) recordings, avoiding training-test mismatch and demonstrating efficiency for extended multi-speaker scenarios.
7. Strengths, Limitations, and Extensions
Streaming Sortformer achieves end-to-end learned arrival-time ordering with no explicit attractor network or permutation solver. Permutation resolution is inherent to the framework: AOSC and Sort Loss enforce cross-chunk speaker consistency. Adaptive per-speaker memory allocation and low-latency operation (RTF < 0.2 at 0.3 s latency) are notable strengths.
Limitations include the current cap of speakers, with extension to planned. Reliance on NEST’s 80 ms frame step may constrain temporal granularity; integrating finer pre-encoders could address this. Integration with streaming ASR to enable speaker-attributed transcriptions is a natural progression.
In sum, Streaming Sortformer with AOSC provides an efficient, accurate solution for low-latency, real-time speaker diarization, bridging offline performance with practical streaming requirements for multi-talker speech applications (Medennikov et al., 24 Jul 2025).