Streaming Sortformer: Real-Time Speaker Diarization

Updated 29 January 2026

The paper introduces Streaming Sortformer, which employs an Arrival-Order Speaker Cache (AOSC) to dynamically allocate embeddings for robust, real-time speaker tracking.
It processes fixed-length audio chunks via a Transformer encoder and score-based selection that preserves arrival-time ordering while maintaining low latency.
Empirical evaluations demonstrate high diarization accuracy and operational efficiency, outperforming baselines on benchmarks like DIHARD and CALLHOME.

Streaming Sortformer is a streaming extension of the Sortformer speaker diarization framework, designed for real-time multi-speaker tracking with arrival-time ordering of output speakers. The framework introduces the Arrival-Order Speaker Cache (AOSC), which stores frame-level acoustic embeddings for previously observed speakers, ordered by their index corresponding to arrival sequence. This mechanism dynamically allocates embeddings per speaker based on a score-driven selection process, ensuring efficient cache utilization and robust speaker tracking under low-latency conditions. Streaming Sortformer achieves high diarization accuracy and operational flexibility, establishing itself as a robust foundation for streaming multi-talker speech processing (Medennikov et al., 24 Jul 2025).

1. Architectural Overview

Streaming Sortformer processes input audio in fixed-length chunks, using a pre-encoder called NEST to extract frame-level Mel-spectrogram embeddings. Each chunk $C_n \in \mathbb{R}^{c \times d}$ , where $c$ is the number of frames and $d$ is the embedding dimension (typically 512), is combined at each iteration with:

$B_n$ : the AOSC output (up to $M$ frames)
$Q_n$ : a FIFO queue containing $L$ recent embeddings
$C_n$ : current chunk plus $r$ frames of right context

These are concatenated into the inference sequence $E_n = [B_n; Q_n; C_n]$ and processed by a stack of Transformer encoder layers. Speaker activity predictions $P_n \in [0,1]^{(c+r) \times s}$ for up to $s \leq 4$ speakers are computed, with predictions emitted for only the current chunk. When an old chunk is removed from the FIFO $Q_n$ , its embeddings are used to update the AOSC module, thereby maintaining the cache according to the arrival order and salience of speaker frames.

2. Arrival-Order Speaker Cache (AOSC)

AOSC is a fixed-size buffer maintaining frame-level NEST embeddings for speakers ordered by their arrival. It partitions the buffer into variable-length blocks for each active speaker $i \in \{1,\ldots,S\}$ , storing a list of embeddings $E^c_i$ and inserting an averaged silence embedding $E^s$ after each speaker block. The cache is concatenated as $[E^c_1; E^s; E^c_2; E^s; \ldots; E^c_S; E^s]$ , with the total number of frames not exceeding $M$ . Speaker blocks always occupy contiguous positions, reflecting both arrival slot assignment and local temporal order within each block.

3. Dynamic Update and Embedding Selection Mechanism

Upon arrival of a new popped chunk $X_{new} \in \mathbb{R}^{cr \times d}$ , AOSC compresses the concatenation of existing cache and $X_{new}$ to $M$ frames using a score-based selection procedure:

Speaker scores for frame $k$ and speaker $i$ :

$S^k_i = \log P^k_i + \sum_{j \neq i} \log(1-P^k_j)$

where $P^k_j$ is the Sortformer probability of speaker $j$ at frame $k$ .

Non-speech frames (where $\forall i, P^k_i < \tau_{silence}$ , e.g., $\tau_{silence}=0.1$ ) are averaged into $E^s$ .
For speaker blocks, frames with $P^k_i < 0.5$ are masked out ( $S^k_i = -\infty$ ).
Recency boost ( $\delta$ , e.g., $0.05$) is applied to embeddings from the recent chunk.
Speaker-balance boost: Top $K$ strong-score frames per speaker receive a positive adjustment $\Delta$ (e.g., $\Delta=-\log 0.5 \approx 0.693$ ), enforcing a minimum number of embeddings per speaker.
Infinite-score slots per speaker correspond to copies of $E^s$ , guaranteeing learnable silence transitions.
The $M$ highest-score indices are selected, ordered by ascending speaker index and temporal order within speaker blocks.

This dynamic allocation results in the number of embeddings per speaker $|E^c_i|$ being data-driven but with a guaranteed floor $K$ for active speakers.

4. Computational Complexity and Latency

Streaming Sortformer operates efficiently in real time:

Transformer encoder self-attention is applied over $L = M + L + (c + r)$ frames with computational cost $O(L^2 d)$ .
AOSC scoring across $M + CR$ frames incurs $O((M + CR) S)$ cost.
Top- $M$ selection involves $O((M + CR) \log(M + CR))$ operations.

Empirical measurements on an NVIDIA RTX 6000 Ada show the following real-time factors (RTF):

RTF = 0.005 at 10 s latency
RTF = 0.093 at 1.04 s latency
RTF = 0.18 at 0.32 s latency

Memory usage for cache and FIFO is a few hundred kilobytes.

5. Training Objectives and Implementation

Offline Sortformer is first trained with permutation-invariant training (PIT) via binary cross-entropy on speaker activity streams and a "Sort Loss" enforcing speaker sequence ordering according to the ground-truth first speech frame. Streaming Sortformer is fine-tuned with windowed training (15 s steps, cache size 188 frames), using 90 s training segments with up to 4 speakers from 5150 h simulated and 2030 h real data. No SpecAugment or noise/RIR is used; robustness stems from corpus diversity and cache mechanism. Augmentations include random permutation of speaker cache slots and 50% right-context dropout per step.

6. Experimental Evaluation

Streaming Sortformer is trained on Fisher, AMI-IHM, DIHARD III Dev, VoxConverse, ICSI, AISHELL-4, CALLHOME Part 1, AliMeeting, DiPCo. Evaluation uses DIHARD III Eval, CALLHOME Part 2, and CH109. Metrics include Diarization Error Rate (DER) with overlap, latency (chunk size + right context), and RTF. Baselines encompass BW-EDA-EEND (10 s latency), EEND-EDA+FW-STB, EEND-GLA+BW-STB, FS-EEND+VCT, LS-EEND (all 1 s latency), and offline Sortformer (infinite-latency).

Results:

CALLHOME (all speakers): Streaming Sortformer-AOSC 13.32% DER vs. 14.93% (EEND-EDA+STB), 12.11% (LS-EEND)
DIHARD (all speakers): 18.97% DER vs. 25.09% (EEND-EDA+STB), 19.61% (LS-EEND)
Sub-second latency (0.32 s) yields only modest DER degradation: 19.32% (DIHARD), 11.50% (CALLHOME)

Streaming Sortformer surpasses its offline counterpart on long (>5 speakers) recordings, avoiding training-test mismatch and demonstrating efficiency for extended multi-speaker scenarios.

7. Strengths, Limitations, and Extensions

Streaming Sortformer achieves end-to-end learned arrival-time ordering with no explicit attractor network or permutation solver. Permutation resolution is inherent to the framework: AOSC and Sort Loss enforce cross-chunk speaker consistency. Adaptive per-speaker memory allocation and low-latency operation (RTF < 0.2 at 0.3 s latency) are notable strengths.

Limitations include the current cap of $S=4$ speakers, with extension to $S=8$ planned. Reliance on NEST’s 80 ms frame step may constrain temporal granularity; integrating finer pre-encoders could address this. Integration with streaming ASR to enable speaker-attributed transcriptions is a natural progression.

In sum, Streaming Sortformer with AOSC provides an efficient, accurate solution for low-latency, real-time speaker diarization, bridging offline performance with practical streaming requirements for multi-talker speech applications (Medennikov et al., 24 Jul 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Streaming Sortformer: Speaker Cache-Based Online Speaker Diarization with Arrival-Time Ordering (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Streaming Sortformer.