End-to-End Neural Diarization (EEND)
- EEND is a neural diarization framework that directly classifies frame-level speaker activity from acoustic features, effectively handling overlapped speech.
- It employs deep architectures such as BLSTM, Transformer, and Conformer with permutation-invariant training to optimize diarization error rates.
- Variants like LS-EEND and AED-EEND enable real-time, streaming diarization with dynamic speaker allocation and state-of-the-art performance on benchmarks.
End-to-End Neural Diarization (EEND) is a paradigm for speaker diarization that reformulates the task as direct, frame-level multi-label classification using deep neural networks. EEND approaches have demonstrated substantial superiority over clustering-based pipelines, particularly in the presence of overlapped speech, and enable the integration of advanced deep learning architectures and end-to-end optimization of Diarization Error Rate (DER) (Fujita et al., 2020). The following sections provide a comprehensive overview of EEND and its major contemporary variants, including LS-EEND, covering architectural principles, streaming and long-form diarization, training methodologies, empirical performance, and system-level implications.
1. Core Principles and Canonical EEND Framework
The foundation of EEND is the direct mapping from input acoustic features, typically log-Mel filterbanks with context stacking, to frame-wise speaker activity posteriors for a set of candidate speakers , at every frame . The network structure consists of:
- A deep encoder, originally BLSTM or Transformer-based, that produces frame-level embeddings .
- A multi-label, permutation-invariant training objective (PIT), which considers all permutations of output streams and minimizes binary cross-entropy with respect to the optimal reference ordering for each utterance, thus resolving the “label permutation” ambiguity:
The output is interpreted as frame-level multi-label speaker activity, naturally handling speech overlaps—i.e., multiple concurrent active speakers (Fujita et al., 2020).
2. Architectural Evolution and Attractor Mechanisms
A central challenge in EEND is supporting variable and flexible numbers of speakers. The Encoder-Decoder Attractor (EDA) module enables this by producing a variable set of “attractor” vectors via a recurrent sequence-to-sequence architecture (LSTM or Transformer variant), conditioned on the encoder output (Horiguchi et al., 2021, Samarakoon et al., 2023). For each attractor, the model computes . A speaker-existence mechanism (sigmoid-activated scalar per attractor) is used to dynamically select the number of speakers during inference.
Notable architectural variants and enhancements include:
- Transformer and Conformer Encoders: Employ multi-head self-attention (Transformer), or hybrid convolutional-attention blocks (Conformer) to capture both long-range speaker relations and local context. Conformer-based EEND further reduces DER, especially when conversational statistics match target data (Liu et al., 2021, Liang et al., 2024).
- Attention-based Decoder Replacements: Transformer-based attractor decoders (as in AED-EEND, EEND-TA) improve both convergence and efficiency compared to LSTM-based EDA (Chen et al., 2023, Samarakoon et al., 2023).
- Embedding Enhancement: Embedding enhancer modules (cross-attention refinements post-attractor extraction) further improve discrimination and robustness to unseen speaker numbers (Chen et al., 2023).
3. Streaming and Long-Form Diarization (LS-EEND and Related)
Standard self-attention incurs quadratic complexity in sequence length, limiting conventional EEND to short to medium-length audio. LS-EEND achieves true online, frame-synchronous diarization with linear temporal complexity, enabling diarization of hour-long recordings in streaming mode (Liang et al., 2024).
- Causal Conformer Encoder: Each block uses multi-head "retention" (a recurrence-compatible attention surrogate), causal convolutions (preserving causality), and L2-normalization of embeddings.
- Online Attractor Decoder: Maintains per-speaker attractors , updated at every frame using along-time retention (O()), and cross-attractor self-attention in the speaker dimension for enhanced speaker separation.
- Retention Mechanism: Retention, replacing self-attention softmax with an exponential decay mask, allows efficient accumulation of past context in both training (parallel recurrent updates) and inference (sequential recurrence).
- Frame-in-Frame-out Processing: Every input frame directly updates all speaker attractors and yields a diarization prediction with minimal delay (≤1 s latency with lookahead).
- Progressive Training: Multi-stage curriculum over increasing speaker count and audio length, chunk-wise retention during long-form adaptation, and output-anchored losses for direct scale-up.
- Performance: LS-EEND achieves new state-of-the-art online DER on CALLHOME (12.11%), DIHARD II (27.58%), DIHARD III (19.61%), and AMI (20.76%) with linear time complexity (RTF ≈ 0.028), outperforming buffer-based and block-wise streaming systems by 3–7 DER points while reducing computational cost by an order of magnitude (Liang et al., 2024).
4. Training Methodologies and Data Simulation
EEND training is characterized by large-scale simulation, domain adaptation, and recent advances in "teacher forcing" and enhanced data synthesis:
- Data Simulation: Both simulated mixtures (random pause distributions) and more realistic simulated conversations (empirical pause/overlap statistics) are used. The latter better emulates target data conversational structure, reducing reliance on fine-tuning and improving generalization (Landini et al., 2022).
- Curriculum Learning: Progressive increase in the number of speakers and utterance length, as in LS-EEND's staged training (Liang et al., 2024).
- Loss Functions: PIT binary cross-entropy for diarization, auxiliary losses (e.g., embedding similarity, intermediate attractor losses), and cross-entropy over auxiliary subtasks (speech activity detection, overlap detection) in multitask setups (Chen et al., 2023, Takashima et al., 2021).
- Pseudo-label and Semi-supervised Training: Iterative pseudo-labeling and committee-based fusion allow effective adaptation to unlabeled domains, yielding up to 37.4% relative DER reduction without ground-truth frame annotations (Takashima et al., 2021).
5. Empirical Performance and Benchmarking
EEND and its derivatives establish state-of-the-art results across diarization benchmarks. Representative best accuracies as reported:
| System | CALLHOME | DIHARD II | DIHARD III | AMI | RTF |
|---|---|---|---|---|---|
| LS-EEND (Liang et al., 2024) | 12.11% | 27.58% | 19.61% | 20.76% | 0.028 |
| AED-EEND+Enh/Conformer(Chen et al., 2023) | 10.08% | 24.64% | 13.00% | — | — |
LS-EEND and AED-EEND variants, without oracle SAD, robustly outperform prior online and offline systems, including those based on offline EEND-EDA, FLEX-STB, and buffer-based inference (Liang et al., 2024, Chen et al., 2023).
6. Extensions: Multi-Channel, Unlimited-Speaker, and System Calibration
- Multi-Channel Diarization: EEND extends naturally to distributed microphone settings via spatio-temporal and co-attention encoder variants, leveraging spatial diversity for superior DER, even in asynchronous or spatially ambiguous conditions (Horiguchi et al., 2021).
- Unlimited-number-of-Speakers: Offline and online block-wise inference coupled with clustering over local attractors allows EEND-GLA to handle recordings with more speakers than seen during training, by relaxing the cap on output tracks via post-hoc clustering (Horiguchi et al., 2022).
- Calibration and Fusion: Probability-level calibration (joint or per-speaker), probability-space fusion, and "Fuse-then-Calibrate" schemes significantly reduce DER and enable use of EEND outputs in risk-aware diarization or system combination. Soft probability fusion outperforms hard-segmentation methods such as DOVER-Lap (Alvarez-Trejos et al., 27 Nov 2025).
7. Methodological Implications and Future Directions
EEND and LS-EEND unify embedding extraction, attractor generation, and diarization into a single, end-to-end optimized, causal network, enabling deployment in real-time, low-latency scenarios such as meeting transcription, conferencing, and streaming ASR front-ends. While current architectures cap maximum speakers by decoder dimension or block design, ongoing advances in unsupervised attractor clustering, dynamic attractor selection, semi-supervised adaptation, and multimodal conditioning continue to extend EEND's flexibility and accuracy envelope (Liang et al., 2024, Alvarez-Trejos et al., 27 Nov 2025, Horiguchi et al., 2022).
Recent results confirm that self-attention and retention-based architectures, progressive and adversarial training, probability calibration, and multi-task learning (including ASR feature conditioning) all contribute significantly to state-of-the-art diarization performance, robust to long-session, multi-speaker, and overlapping-speech conditions.