Serialized Output Training Scheme Overview

Updated 27 January 2026

Serialized Output Training (SOT) is an end-to-end approach that converts complex multi-speaker/multi-domain transcriptions into a serialized sequence using role and boundary tokens.
It employs a single decoder to generate multiple outputs sequentially, simplifying traditional architectures and reducing inference complexity compared to permutation invariant methods.
SOT has been applied in ASR, joint ASR+ST, and multi-label classification, yielding improved recognition accuracy and latency reductions in streaming recognition.

Serialized Output Training (SOT) is an end-to-end technique that reformulates multi-speaker and multi-domain sequence modeling as a single-branch, autoregressive generation problem, leveraging explicit serialization of output targets with task-specific structural tokens. SOT originated in overlapped multi-speaker automatic speech recognition (ASR), where the challenge of transcribing concurrent utterances precluded traditional one-output-per-input architectures. By imposing a linear ordering and special tokens between constituent targets (e.g., speaker-change or task markers), SOT enables a single decoder to emit all necessary outputs sequentially, supporting arbitrary speaker counts and modalities and facilitating efficient streaming recognition and translation.

1. Conceptual Foundations and Serialization Principles

SOT addresses the problem of mapping complex, overlapping input domains such as multi-talker speech mixtures, joint transcription + translation, or multi-label classification, into a single output stream. The key principle involves constructing a serialized target sequence by concatenating each individual ground-truth output with separator tokens that demarcate boundaries (e.g., ⟨sc⟩ for speaker change, ⟨ASR⟩/⟨ST⟩ for domain switch), followed by an end-of-sequence marker (Kanda et al., 2020). For multi-speaker ASR, this means:

$R = \mathrm{concat}(R^{\mathrm{perm}[1]}, \langle sc \rangle, R^{\mathrm{perm}[2]}, \langle sc \rangle, \dots, R^{\mathrm{perm}[S]}, \langle eos \rangle)$

where $R^i$ is the transcript for speaker $i$ , and the permutation is determined by a data-driven criterion (e.g., FIFO onset, learned dominance, or minimal loss) (Shi et al., 2024). SOT generalizes naturally to other tasks:

Joint ASR+ST: interleave source and target language tokens using statistical or neural alignment, marked by task tokens (Papi et al., 2023, Papi et al., 2023).
Multi-label XMC: serialize candidate labels with an evolving sparse prediction mask (Ullah et al., 2024).

This serialized representation provides a unified training and inference target, simplifying sequence prediction architectures and supporting variable output cardinality.

2. Architectures, Model Variants, and Token Integration

The canonical SOT instantiation employs a single attention-based encoder-decoder (AED) or streaming transducer (TT/RNN-T), extended only by augmenting the output vocabulary with boundary and role tokens. No additional output heads or branches are required—context across speakers or domains is captured implicitly in the decoder history (Kanda et al., 2020, Kanda et al., 2022). Notable architectural variants include:

Classic AED SOT: BLSTM encoder + LSTM decoder; uses ⟨sc⟩ for speaker changes (Kanda et al., 2020).
Streaming t-SOT: Transformer-Transducer backbone with time-sorted token insertion for streaming operation and latency optimization (Kanda et al., 2022).
Boundary-aware SOT (BA-SOT): Auxiliary decoder head for speaker-change detection and boundary constraint regularization, enforcing temporal boundaries in cross-attention (Liang et al., 2023).
Speaker-aware SOT (SA-SOT): Token-level masking loss and speaker-similarity-modified self-attention to minimize cross-speaker context bleeding (Fan et al., 2024).
Dominance-based SOT (DOM-SOT): Auxiliary CTC module computes per-speaker dominance scores for serialization order; AED trained to produce output in dominance-derived sequence (Shi et al., 2024).
SOT in joint ASR+ST: Shared decoder emits both transcription and translation tokens, interleaved by linguistic alignment and domain tags (Papi et al., 2023, Papi et al., 2023).
SOT for XMC and LLM prompting: Sparse classifier layers and serialized prompt extraction for high-cardinality labels or multi-speaker LLM input scaffolding (Ullah et al., 2024, Shi et al., 1 Sep 2025).

Token integration is managed by simply extending the output vocabulary and adapting post-processing to segment serialized outputs according to markers.

3. Training Methodologies and Objective Functions

Training in SOT is governed by cross-entropy on the serialized target, optionally combined with auxiliary frame-level or boundary-aware objectives. The standard loss is:

$\mathcal{L}_{\mathrm{SOT}} = -\sum_{n=1}^{N^{\mathrm{SOT}}} \log p_{\theta}(y_n \mid y_{<n}, X)$

where $y_n$ is the $n$ th token in the serialized sequence. Label-permutation considerations are handled by either:

Exhaustive permutation search (SOT-1, factorial cost, rare in practice),
First-in-first-out ordering by onset time (O(S) complexity, widely used) (Kanda et al., 2020),
Model-driven ordering via dominance scores or learned serialization modules (Shi et al., 2024).

Additional terms may include:

CTC loss (frame-wise monotonic alignment) on the non-⟨sc⟩ token sequence (Liang et al., 2023, Shi et al., 2024).
Speaker-change detection or boundary loss via auxiliary classifier (Liang et al., 2023).
Masked speaker loss for identity-aware training (Fan et al., 2024).
Hybrid CTC/Attention loss after encoder separation modules (Shi et al., 2024).
Meta-classifier or cluster loss for sparse output label tasks (Ullah et al., 2024).

The training pipeline is otherwise equivalent to single-output or single-task models, with serialization logic implemented in data preprocessing and target generation.

4. Comparative Analysis: SOT versus PIT and Alternative Schemes

Permutation Invariant Training (PIT) was the de facto strategy for multi-output mapping, requiring multiple parallel output heads and minimization over all output-reference permutations. PIT suffers from scalability limitations (fixed maximum output count, factorial training cost), inability to model inter-output dependencies, and high inference complexity (Kanda et al., 2020).

SOT overcomes these limitations by:

Using a single decoder with serialization, supporting arbitrary speaker or domain counts,
Avoiding permutation matching during inference (O(S) sorting/training via start times or model-based rules),
Allowing inter-output context via sequential decoding (previous output history accessible to the decoder),
Achieving state-of-the-art recognition accuracy with lower architectural and operational complexity (Kanda et al., 2020, Kanda et al., 2022, Shi et al., 2024).

Boundary-aware, speaker-aware, and dominance-based SOT further extend these benefits, mitigating misplacement of boundary tokens and optimizing output order adaptively (Liang et al., 2023, Fan et al., 2024, Shi et al., 2024).

5. Application Domains and Empirical Results

SOT and its derivatives have been applied to numerous complex sequence tasks:

Multi-speaker ASR: Significant reductions in WER compared to PIT and single-branch streaming baselines. Variable-S training achieves generalized models for arbitrary speaker counts; e.g., SOT yields 4.6/11.2/24.0% WER on LibriSpeech for 1/2/3 speakers (Kanda et al., 2020). State-of-the-art streaming WER (13.7%/15.5% on AMI development/evaluation) is achieved by integrating VarArray separation and t-SOT (Kanda et al., 2022).
Speaker-aware and boundary-aware ASR: SA-SOT and BA-SOT methods achieve up to 22% relative cpWER reduction and 19.9% relative UD-CER reduction by explicit context modeling and boundary regularization (Fan et al., 2024, Liang et al., 2023).
Joint ASR and translation: SOT architectures (INTER TIME, ALIGN) outperform separate and multitask systems for ASR+ST, halving inference cost and reducing latency by ≈200 ms while achieving equivalent or superior WER/BLEU (Papi et al., 2023, Papi et al., 2023).
High-cardinality classification and XMC: SOT in dynamic sparse classifiers reduces GPU memory footprint by 3–5x (e.g., 13.5 GiB vs 46.3 GiB for 2.8M labels at 83% sparsity), with auxiliary objectives recovering dense accuracy (Ullah et al., 2024).
LLM-based multi-talker ASR: SOP-based SOT pipelines enable structured prompting for LLMs, improving multi-talker recognition in challenging overlap conditions (Shi et al., 1 Sep 2025).
Joint ASR and speaker-role tagging: Augmenting Whisper with SOT and role tokens yields >10% reduction in multi-talker WER across datasets (Xu et al., 12 Jun 2025).

A representative summary of empirical gains in multi-speaker ASR is shown below.

Model Type	2-spk WER (%)	3-spk WER (%)	Speaker Counting (%)
PIT	11.9	—	—
SOT (fifo)	11.2	24.0	74.2 (3→3 correct)
SA-SOT	4.45	—	—
BA-SOT	20.6 (CER)	—	—
DOM-SOT	5.56	9.96	—

6. Limitations and Directions for Future Work

Key limitations of SOT include:

Occasional underestimation of speaker count in closely overlapped mixtures (e.g., 24% of 3-way mixtures classified as 2 speakers) (Kanda et al., 2020).
Assumption that speaker onset times do not coincide (rare, ties broken arbitrarily).
Sensitivity of boundary token placement—frequent turns and overlap increase risk of misplacement, mitigated by boundary-aware strategies (Liang et al., 2023).
In streaming and high-cardinality settings, gradient starvation in sparse output layers necessitates auxiliary adaptations (Ullah et al., 2024).
For LLM-based ASR, vanilla SOT underperforms in highly complex, overlapping scenes without explicit serialized prompting scaffolds (Shi et al., 1 Sep 2025).

Active research seeks to address these constraints via more advanced separation modules, joint diarization and ASR with speaker-identity tags, multi-head attention, auxiliary training objectives, dynamic output fusion, and application of SOT-style serialization to joint recognition, translation, and speaker role annotation across modalities (Liang et al., 2023, Fan et al., 2024, Papi et al., 2023).

7. Cross-domain Extensions and Methodological Implications

SOT is extensible beyond speech recognition and translation:

In extreme multi-label classification, serialization of label outputs aligned to learned dominance or confidence metrics supports scalable training for millions of categories (Ullah et al., 2024).
For joint multi-modal tasks, serialized interleaving of tokens with attention to emission timestamps, explicit role tags, or alignment information enables unified models spanning recognition, translation, tagging, and structured generation (Papi et al., 2023, Papi et al., 2023, Xu et al., 12 Jun 2025, Shi et al., 1 Sep 2025).
Methodologically, SOT reframes sequence modeling by decoupling output dimensionality from architectural complexity, introducing a flexible target structure amenable to variable output counts, streaming inference, and cross-domain transfer learning.

A plausible implication is that serialized output training will further influence end-to-end modeling in scenarios where output ordering, structure, or cardinality is variable, such as multi-role dialogue, joint ASR-NLU pipelines, and data-augmented LLM prompting. Research continues into model-driven serialization criteria, adaptive boundary and role token placement, and integration with unsupervised speaker and domain identification.