Full-Duplex Dialogue System
- Full-duplex dialogue systems are interactive agents that enable simultaneous bidirectional audio, supporting overlapping speech, mid-utterance interruptions, and backchannels.
- They deploy either engineered synchronization with explicit state management or end-to-end learned approaches to achieve low-latency, coherent conversation.
- Evaluation metrics include turn-taking accuracy, speech latency, and interruption success rate, while challenges remain in data scarcity and noise robustness.
A full-duplex dialogue system is an interactive speech-based agent that enables real-time, simultaneous bidirectional communication—allowing both the user and the agent to speak, listen, and "think" concurrently. Unlike traditional turn-based (half-duplex) systems that strictly alternate between listening and speaking, full-duplex systems are designed to model key aspects of natural dialogue: overlapping speech, mid-utterance interruptions ("barge-in"), backchannels, and tightly coordinated turn-taking (Chen et al., 18 Sep 2025). Achieving true full-duplexity in spoken dialogue systems (SDS) requires sophisticated architectural, algorithmic, and training strategies to resolve concurrency, minimize latency, manage behavioral arbitration, and sustain semantic coherence across continuous multimodal streams. Contemporary research has converged on a split between modular (engineered synchronization) and end-to-end (learned synchronization) paradigms, both aiming for scalable, interpretable, and robust synchronous interaction.
1. Defining Full-Duplex Dialogue: Formal Characterizations
Full-duplex dialogue systems permit simultaneous streaming of user and agent audio, with each channel being encoded, processed, and produced in parallel at low latency. Mathematically, true full-duplex operation can be described by dual token streams (user/environment) and (agent), which are temporally aligned such that
with loss function
$\mathcal{L}_{\mathrm{NTPP}(\theta) =-\mathbb{E}_{(S^\mathcal{E},S^\mathcal{A})} \sum_{t=1}^T\log P\bigl(e_t,a_t\mid S^\mathcal{E}_{<t},S^\mathcal{A}_{<t};\theta\bigr) \tag{2}$
This "next-token-pair prediction" (NTPP) paradigm is foundational to end-to-end approaches that treat both channels as causally dependent, enabling overlapping output and concurrent processing (Chen et al., 18 Sep 2025Ge et al., 26 Sep 2025). Engineered synchronization approaches may instead mediate concurrency with explicit state machines or control heads atop modular ASR–LM–TTS pipelines (Liao et al., 19 Feb 2025).
Critical behavioral phenomena modeled by true FD systems include:
- Floor-holding: Agent continues to speak while user attempts to interrupt, requiring immediate recognition and response (barge-in management).
- Backchanneling: Agent can emit acknowledgments or encouragements ("uh-huh", "go on") mid-user-utterance without forcing a user turn switch.
- Precise low-latency turn arbitration: Agent transitions between listening and speaking within 100–300 ms, matching or exceeding human conversation standards (Chen et al., 18 Sep 2025Peng et al., 25 Jul 2025).
2. Taxonomies: Architectural Approaches to Full-Duplexity
Full-duplex systems can be classified into two primary paradigms (Chen et al., 18 Sep 2025):
A) Engineered Synchronization (Modular)
- Separate "duplex controller" or FSM mediates between ASR, LLM, and TTS components, decoupling logic for flexibility. Examples: FlexDuo (Liao et al., 19 Feb 2025), FireRedChat (Chen et al., 8 Sep 2025), Easy Turn (Li et al., 28 Sep 2025).
- Plug-and-play designs allow full-duplex overlays atop legacy half-duplex pipelines without retraining core models (Liao et al., 19 Feb 2025Chen et al., 8 Sep 2025).
- Controllers predict among three or more dialogue states (Listen, Speak, Idle) and issue corresponding actions at fine-grained intervals (e.g., every 120 ms in FlexDuo).
- Streaming dialogue managers use semantic VADs (e.g., LLM-based (Zhang et al., 19 Feb 2025)) or bimodal turn-detection heads, often trained to robustly classify more granular states than simple VAD.
B) Learned Synchronization (End-to-End, E2E)
- Monolithic models are trained to autoregressively consume and produce both user and agent audio streams, sometimes with embedded text or control tokens (Zhang et al., 2024Cui et al., 10 Aug 2025Yu et al., 2024).
- Direct NTPP or hierarchical, graph-based causal structures (e.g., SCoT's chain-of-thought blockwise dependencies (Arora et al., 2 Oct 2025)) govern the streaming generation process, supporting simultaneous listening and speaking.
- Codec- or embedding-free models (e.g., SALMONN-omni's continuous embeddings (Yu et al., 2024)), mixed, or neural codec tokenization (Moshi, SyncLLM (Veluri et al., 2024Ohashi et al., 3 Jun 2025)).
- End-to-end decision-making enables emergent handling of interruptions, backchannels, and echo cancellation without explicit state modules (Yu et al., 2024).
| Paradigm | Control/Turn Arbitration | Example Systems / Models |
|---|---|---|
| Modular (Engineered Sync) | FSM, LLM-based state heads, VAD | FlexDuo, FireRedChat, Easy Turn |
| End-to-End (Learned Sync) | NTPP, explicit/implicit tokens | SALMONN-omni, OmniFlatten, SCoT |
3. Component Methods and Behavioral Arbitration
Modular FSM/Control-Head Approaches
Modular systems enforce explicit state transitions, typically among at least three dialogue states:
- Listen: User speaking; forward audio to ASR.
- Speak: Agent responding; monitor for user interrupts.
- Idle: Neither party has the floor; filter out noise, non-informative backchannels (Liao et al., 19 Feb 2025).
The state manager predicts atomic actions—maintain state, perform transitions (e.g., Listen→Speak)—every 100–200 ms window. The controller may use:
- Bimodal classifiers (audio+text) (Li et al., 28 Sep 2025)
- LLM-based semantic VADs for nuanced control tokens such as <|Continue-Listening|>, <|Start-Speaking|>, <|Start-Listening|>, <|Continue-Speaking|>, capturing query incompleteness, intentional/unintentional barge-ins, and hold/continue logic (Zhang et al., 19 Feb 2025).
Plug-and-Play Modularity: Control modules (e.g., FlexDuo) are trainable independently from ASR/LLM/TTS, allowing integration or replacement without retraining or re-architecting core models (Liao et al., 19 Feb 2025).
E2E/Jointed Streaming Methods
End-to-end systems (e.g., SCoT, SALMONN-omni, OmniFlatten) operate under causal prediction:
- Simultaneously encode context from both user and agent streams, emit control/behavior tokens (e.g. <start_speak>, <end_speak>, > ). > > - Train with hierarchical objectives: ASR, chain-of-thought/intent, semantic text, audio token prediction (blockwise or synchronously). > > - Planning-inspired strategies (e.g., TurnGuide) segment assistant speech into turns and emit turn-level plans ahead of speech, aligning insertion timing precisely (Cui et al., 10 Aug 2025Arora et al., 2 Oct 2025). > > - Explicit modeling of speaker states enables responsivity to interruptions (interrupt latency, barge-in response) and robust emergent turn-taking. > > Notable technical advances: > > - Chain-of-Thought streaming (SCoT): blockwise forced alignments produce interpretable reasoning chains and reduced latency (Arora et al., 2 Oct 2025). > > - Codec-free full-duplex LLMs (SALMONN-omni): continuous embeddings avoid quantization bottlenecks, enabling fully differentiable, streaming behavior, and integrated echo cancellation (Yu et al., 2024). > > - Real-time synchronization: Synchronous LLMs use chunked interleaved streams with explicit timing anchors (speaker tags) to ensure alignment under arbitrary (~160-240 ms) network latency (Veluri et al., 2024). > > ## 4. Evaluation, Benchmarking, and Metrics > > Rigorous, reproducible evaluation is foundational for benchmarking progress in full-duplex dialogue. State-of-the-art systems are assessed across four primary axes (Chen et al., 18 Sep 2025Peng et al., 25 Jul 2025Lin et al., 9 Oct 2025): > > 1. Temporal Dynamics > - First-Turn Offset (FTO): time between user turn end and agent response start. > - Speech Latency (SL): mean token/chunk-level system latency. > - Interrupt-Response Delay (IRD), First-Speech-Emit Delay (FSED) (Peng et al., 25 Jul 2025). > > 2. Behavioral Arbitration > - Interruption Success Rate (ISR): fraction of correct barge-in or interruption terminations. > - Early-Interrupt Rate (EIR) and Success-Interrupt Rate (SIR): timely and accurate interruption management (Peng et al., 25 Jul 2025). > - Turn-taking accuracy, overlap ratio, transition prediction F1 (Chen et al., 18 Sep 2025Arora et al., 2 Oct 2025). > > 3. Semantic Coherence & Conversational Quality > - Perplexity of streamed agent responses. > - Multi-turn instruction following, stage completion (e.g., Full-Duplex-Bench-v2 stage-gated examiner (Lin et al., 9 Oct 2025)). > - Conditioned Perplexity (C-PPL), dialogic QA accuracy, entity/co-reference tracking. > > 4. Acoustic and Perceptual Performance > - Mean Opinion Score (MOS): naturalness and global speech quality. > - UTMOS for TTS/fused audio. > - Robustness under noise and channel conditions. > > Benchmark frameworks such as FLEXI (Ge et al., 26 Sep 2025), Full-Duplex-Bench(-v2) (Lin et al., 9 Oct 2025), FD-Bench (Peng et al., 25 Jul 2025), and others provide LLM-examiner-driven, task-specialized, and latency-anchored evaluation, enabling side-by-side comparisons of open and closed systems under multi-turn, interruption-rich scenarios. > > ## 5. Datasets, Training, and Multilingual Instantiations > > High-quality full-duplex data remains a core bottleneck (Chen et al., 18 Sep 2025): > > - Real stereo conversational corpora (e.g. Fisher, AMI, ICSI) are limited; most systems leverage large-scale synthetic speech dialogues generated via TTS, with controlled insertion of backchannels, interruptions, and staged goals (Lin et al., 9 Oct 2025Ohashi et al., 3 Jun 2025). > > - Multilingual extensions (e.g., Japanese full-duplex Moshi adaptation (Ohashi et al., 3 Jun 2025)) require vocabulary swaps, language-specific text head weights, and often fine-tuning pre-trained neural codecs and temporal transformers to match new language statistics and turn-taking conventions (e.g., backchanneling rates). > > - Modular pipelines (Easy Turn, FlexDuo, FireRedChat) support cross-lingual extension via retraining or prompt tuning of the lightweight duplex controller, whereas monolithic E2E models require complete multi-language corpora. > > - Recent architectures employ hybrid corpora that combine simulated dialogue (via staged, controllable LLM/TTS synthesis), adversarial noise/interruption injection, and human labeling for critical events (Arora et al., 2 Oct 2025Pan et al., 25 Dec 2025). > > ## 6. Limitations, Open Challenges, and Future Research Directions > > Full-duplex dialogue modeling faces several persistent challenges: > > - Data Scarcity: A deficit of natural two/multi-party, multi-turn, overlapped spoken dialogue datasets stifles representation learning for authentic backchanneling and interruption patterns (Chen et al., 18 Sep 2025). > > - Architectural Fragmentation: Lack of convergence on primitives for sync, control, and interface standardization. Divergent codec (token-indices vs. continuous embeddings), controller, and fusion designs impede transferability (Chen et al., 18 Sep 2025Yu et al., 2024). > > - Evaluation Gaps: Many systems lack in-depth, scenario-specific behavioral benchmarks, with limited stress-testing of correction, safety, and semantic consistency under pressure (Lin et al., 9 Oct 2025Peng et al., 25 Jul 2025). > > - Latency–Intelligence Trade-offs: Architectures that minimize latency sometimes underperform in semantic or context-tracking metrics; conversely, high-level semantic chains (CoT) can add inference overhead (Arora et al., 2 Oct 2025). > > - Noise and Environment Robustness: Handling false-positive interrupts due to background noise and distinguishing between intentional/unintentional barge-ins (Li et al., 28 Sep 2025Liao et al., 19 Feb 2025). > > - Safety and Real-Time Filtering: Live filtering for policy/safety under simultaneous output, and robust management of agent-initiated barge-ins remain underexplored (Chen et al., 18 Sep 2025Lin et al., 9 Oct 2025). > > Future research priorities include: > > - End-to-end architectures scalable to multi-modal, multi-party, and multi-lingual contexts with integrated safety filters and content moderation (Yao et al., 2 Jun 2025Yu et al., 2024Pan et al., 25 Dec 2025). > > - Realistic conversational data generation, automated diarization for pseudo-multichannel bootstrapping, and expansion of open multi-turn, multi-overlap corpora (Ohashi et al., 3 Jun 2025Peng et al., 25 Jul 2025). > > - Unified API and sync-token specification for system interoperability and adoption (Chen et al., 18 Sep 2025). > > - Incorporation of visual (lip, gaze) and paralinguistic cues into duplex dialogue (Zhang et al., 19 Feb 2025Yu et al., 2024). > > - Adaptive, context-sensitive management of chunking, buffering, and decision thresholds to optimize both latency and robustness (Liao et al., 19 Feb 2025Peng et al., 25 Jul 2025). > > ## 7. Historical Context and Industrial Deployment > > The notion of full-duplex dialogue has evolved from classical telephony principles—half-duplex corresponds to push-to-talk radios (XOR speaking), full-duplex to simultaneous telephone conversations (Lin et al., 2022). Early implementations focused on modular pipelines, with voice activity segmentation and barge-in detectors (Lin et al., 2022). Recent years witnessed a transition to LLM-driven, deeply integrated E2E solutions, ranging from production-scale customer service deployments at Alibaba (Lin et al., 2022) to open-source E2E speech-text LLMs and low-latency, omnimodal agents (Zhang et al., 2024Yao et al., 2 Jun 2025). Empirical studies robustly demonstrate improvements in floor-transfer latency, interruption precision, and subjective naturalness, with deployment-level reductions in user-perceived latency by nearly 50% and >8% improvement in barge-in precision against leading commercial systems (Wang et al., 2024Lin et al., 2022). > > --- > > In summary, full-duplex dialogue systems constitute a rapidly maturing area at the intersection of speech processing, LLMs, and behavioral modeling, with architectures spanning from modular plug-in controllers to monolithic E2E transformers. Advances in behavioral arbitration, efficient joint stream modeling, turn-reasoning via chain-of-thought, and multi-modal fusion are establishing new technical baselines. Persistent challenges in data, scalability, standardization, and safety define the current research frontier (Chen et al., 18 Sep 2025Lin et al., 9 Oct 2025Peng et al., 25 Jul 2025).