Papers
Topics
Authors
Recent
Search
2000 character limit reached

Streaming Sortformer: Real-Time Speaker Diarization

Updated 29 January 2026
  • The paper introduces Streaming Sortformer, which employs an Arrival-Order Speaker Cache (AOSC) to dynamically allocate embeddings for robust, real-time speaker tracking.
  • It processes fixed-length audio chunks via a Transformer encoder and score-based selection that preserves arrival-time ordering while maintaining low latency.
  • Empirical evaluations demonstrate high diarization accuracy and operational efficiency, outperforming baselines on benchmarks like DIHARD and CALLHOME.

Streaming Sortformer is a streaming extension of the Sortformer speaker diarization framework, designed for real-time multi-speaker tracking with arrival-time ordering of output speakers. The framework introduces the Arrival-Order Speaker Cache (AOSC), which stores frame-level acoustic embeddings for previously observed speakers, ordered by their index corresponding to arrival sequence. This mechanism dynamically allocates embeddings per speaker based on a score-driven selection process, ensuring efficient cache utilization and robust speaker tracking under low-latency conditions. Streaming Sortformer achieves high diarization accuracy and operational flexibility, establishing itself as a robust foundation for streaming multi-talker speech processing (Medennikov et al., 24 Jul 2025).

1. Architectural Overview

Streaming Sortformer processes input audio in fixed-length chunks, using a pre-encoder called NEST to extract frame-level Mel-spectrogram embeddings. Each chunk CnRc×dC_n \in \mathbb{R}^{c \times d}, where cc is the number of frames and dd is the embedding dimension (typically 512), is combined at each iteration with:

  • BnB_n: the AOSC output (up to MM frames)
  • QnQ_n: a FIFO queue containing LL recent embeddings
  • CnC_n: current chunk plus rr frames of right context

These are concatenated into the inference sequence En=[Bn;Qn;Cn]E_n = [B_n; Q_n; C_n] and processed by a stack of Transformer encoder layers. Speaker activity predictions Pn[0,1](c+r)×sP_n \in [0,1]^{(c+r) \times s} for up to s4s \leq 4 speakers are computed, with predictions emitted for only the current chunk. When an old chunk is removed from the FIFO QnQ_n, its embeddings are used to update the AOSC module, thereby maintaining the cache according to the arrival order and salience of speaker frames.

2. Arrival-Order Speaker Cache (AOSC)

AOSC is a fixed-size buffer maintaining frame-level NEST embeddings for speakers ordered by their arrival. It partitions the buffer into variable-length blocks for each active speaker i{1,,S}i \in \{1,\ldots,S\}, storing a list of embeddings EicE^c_i and inserting an averaged silence embedding EsE^s after each speaker block. The cache is concatenated as [E1c;Es;E2c;Es;;ESc;Es][E^c_1; E^s; E^c_2; E^s; \ldots; E^c_S; E^s], with the total number of frames not exceeding MM. Speaker blocks always occupy contiguous positions, reflecting both arrival slot assignment and local temporal order within each block.

3. Dynamic Update and Embedding Selection Mechanism

Upon arrival of a new popped chunk XnewRcr×dX_{new} \in \mathbb{R}^{cr \times d}, AOSC compresses the concatenation of existing cache and XnewX_{new} to MM frames using a score-based selection procedure:

  • Speaker scores for frame kk and speaker ii:

Sik=logPik+jilog(1Pjk)S^k_i = \log P^k_i + \sum_{j \neq i} \log(1-P^k_j)

where PjkP^k_j is the Sortformer probability of speaker jj at frame kk.

  • Non-speech frames (where i,Pik<τsilence\forall i, P^k_i < \tau_{silence}, e.g., τsilence=0.1\tau_{silence}=0.1) are averaged into EsE^s.
  • For speaker blocks, frames with Pik<0.5P^k_i < 0.5 are masked out (Sik=S^k_i = -\infty).
  • Recency boost (δ\delta, e.g., $0.05$) is applied to embeddings from the recent chunk.
  • Speaker-balance boost: Top KK strong-score frames per speaker receive a positive adjustment Δ\Delta (e.g., Δ=log0.50.693\Delta=-\log 0.5 \approx 0.693), enforcing a minimum number of embeddings per speaker.
  • Infinite-score slots per speaker correspond to copies of EsE^s, guaranteeing learnable silence transitions.
  • The MM highest-score indices are selected, ordered by ascending speaker index and temporal order within speaker blocks.

This dynamic allocation results in the number of embeddings per speaker Eic|E^c_i| being data-driven but with a guaranteed floor KK for active speakers.

4. Computational Complexity and Latency

Streaming Sortformer operates efficiently in real time:

  • Transformer encoder self-attention is applied over L=M+L+(c+r)L = M + L + (c + r) frames with computational cost O(L2d)O(L^2 d).
  • AOSC scoring across M+CRM + CR frames incurs O((M+CR)S)O((M + CR) S) cost.
  • Top-MM selection involves O((M+CR)log(M+CR))O((M + CR) \log(M + CR)) operations.

Empirical measurements on an NVIDIA RTX 6000 Ada show the following real-time factors (RTF):

  • RTF = 0.005 at 10 s latency
  • RTF = 0.093 at 1.04 s latency
  • RTF = 0.18 at 0.32 s latency

Memory usage for cache and FIFO is a few hundred kilobytes.

5. Training Objectives and Implementation

Offline Sortformer is first trained with permutation-invariant training (PIT) via binary cross-entropy on speaker activity streams and a "Sort Loss" enforcing speaker sequence ordering according to the ground-truth first speech frame. Streaming Sortformer is fine-tuned with windowed training (15 s steps, cache size 188 frames), using 90 s training segments with up to 4 speakers from 5150 h simulated and 2030 h real data. No SpecAugment or noise/RIR is used; robustness stems from corpus diversity and cache mechanism. Augmentations include random permutation of speaker cache slots and 50% right-context dropout per step.

6. Experimental Evaluation

Streaming Sortformer is trained on Fisher, AMI-IHM, DIHARD III Dev, VoxConverse, ICSI, AISHELL-4, CALLHOME Part 1, AliMeeting, DiPCo. Evaluation uses DIHARD III Eval, CALLHOME Part 2, and CH109. Metrics include Diarization Error Rate (DER) with overlap, latency (chunk size + right context), and RTF. Baselines encompass BW-EDA-EEND (10 s latency), EEND-EDA+FW-STB, EEND-GLA+BW-STB, FS-EEND+VCT, LS-EEND (all 1 s latency), and offline Sortformer (infinite-latency).

Results:

  • CALLHOME (all speakers): Streaming Sortformer-AOSC 13.32% DER vs. 14.93% (EEND-EDA+STB), 12.11% (LS-EEND)
  • DIHARD (all speakers): 18.97% DER vs. 25.09% (EEND-EDA+STB), 19.61% (LS-EEND)
  • Sub-second latency (0.32 s) yields only modest DER degradation: 19.32% (DIHARD), 11.50% (CALLHOME)

Streaming Sortformer surpasses its offline counterpart on long (>5 speakers) recordings, avoiding training-test mismatch and demonstrating efficiency for extended multi-speaker scenarios.

7. Strengths, Limitations, and Extensions

Streaming Sortformer achieves end-to-end learned arrival-time ordering with no explicit attractor network or permutation solver. Permutation resolution is inherent to the framework: AOSC and Sort Loss enforce cross-chunk speaker consistency. Adaptive per-speaker memory allocation and low-latency operation (RTF < 0.2 at 0.3 s latency) are notable strengths.

Limitations include the current cap of S=4S=4 speakers, with extension to S=8S=8 planned. Reliance on NEST’s 80 ms frame step may constrain temporal granularity; integrating finer pre-encoders could address this. Integration with streaming ASR to enable speaker-attributed transcriptions is a natural progression.

In sum, Streaming Sortformer with AOSC provides an efficient, accurate solution for low-latency, real-time speaker diarization, bridging offline performance with practical streaming requirements for multi-talker speech applications (Medennikov et al., 24 Jul 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Streaming Sortformer.