Listening-while-Speaking Language Model (LSLM)

Updated 2 December 2025

LSLM is a speech-driven neural system designed for continuous, interactive spoken communication by interleaving listening and speaking in real time.
It employs multi-stream architectures and chain-of-thought reasoning to optimize timing, interruption handling, and latency–accuracy trade-offs.
Empirical evaluations reveal rapid response times, robust interruption detection, and enhanced performance under noisy and dynamic conditions.

A Listening-while-Speaking LLM (LSLM) is a speech-driven neural system designed for full-duplex, real-time spoken interaction, integrating both continual perception of user audio and immediate response generation. Unlike turn-based dialogue systems, LSLMs fuse listening and speaking by interleaving input streams, controlling output timing, and supporting interruption, which yields human-like responsiveness and robust interaction even under challenging reasoning or latency constraints. Recent frameworks implement LSLMs using sophisticated multi-stream architectures, @@@@1@@@@ triggers, preference-driven optimization strategies, and explicit mechanisms for policy adaptation, interruption detection, and streaming fusion.

1. Fundamental Architectures in Listening-while-Speaking LMs

LSLMs employ multi-stream or unified autoregressive designs to process input and output concurrently. For speech reasoning, a canonical architecture utilizes streams for user audio tokens ( $A^{U}_t$ ), system audio tokens ( $A^{S}_t$ ), and system text tokens ( $T^{S}_t$ ), all tightly time-aligned. At each timestep, the streams are updated as follows:

$A^{U}_t, A^{S}_t, T^{S}_t \xrightarrow{\text{Temporal Transformer}} T^{S}_{t+1}$
$T^{S}_{t+1} \xrightarrow{\text{Depth Transformer}} A^{S}_{t+1}$

The model maximizes $p(A^{S}_{t+1}, T^{S}_{t+1} | A^{S}_{\leq t}, T^{S}_{\leq t}, A^{U}_{\leq t})$ via negative log-likelihood over all streams. Text tokens are padded and interleaved such that audio and monologue channels remain fully revisable mid-utterance.

Alternative approaches include speech-to-speech LLMs with implicit chain-of-thought (ICoT) internalization that gradually drop explicit transcription steps in training, as described for A-T-A systems, compressing ASR reasoning into latent model states (Yuen et al., 2024). End-to-end designs may also fuse streaming self-supervised encoders for live audio with decoder-only TTS blocks and integrate both channels at multiple points (early/middle/late fusion) for robust interruption handling (Ma et al., 2024). Modular full-duplex systems coordinate LLMs with neural finite state machines (FSM), streaming ASR, and TTS, presenting interaction as next-token autoregression on a serialized tape (Wang et al., 2024).

2. Reasoning and Timing: Chain-of-Thought and Question Completeness

Complex spoken reasoning in LSLMs leverages chain-of-thought (CoT) methodologies. Systems are fine-tuned on triplets $(Q^A,\ R^T,\ A^A)$ , where $Q^T$ (transcribed question) precedes $R^T$ (reasoning, bracketed by <start_cot> / <end_cot>), and $A^T$ (answer). Standard next-token cross-entropy is applied:

$L_{\rm SFT} = -\sum_t \log \pi_\theta(u_t|u_{<t})$

yielding substantial accuracy boosts (2.4× baseline on reasoning tasks, e.g., ARC-E from 30.2% to 77.7%) (Shih et al., 8 Oct 2025).

To reduce latency, LSLMs implement semantic triggers for early reasoning via question completeness $\zeta(p)$ :

$\zeta(p) = 1 - D_{KL}[X_N \| X_p] / D_{KL}[X_N \| X_0]$

where $X_p$ denotes the distribution over reasoning and answer given only the first $p$ words of the question. A threshold $\theta$ sets the inflection point, enabling reasoning to begin before the end of spoken query. Entropy proxies serve as simpler alternatives but are less robust. This mechanism yields fine-grained latency–accuracy trade-offs and traces out convex Pareto frontiers.

3. Policy-Making and Simultaneous Generation

LSLMs incorporate explicit policy-makers to optimize when to emit responses. In simultaneous generation settings, LLM-driven frameworks like LSG prompt the LLM to decide action $\{READ,\,WRITE\}$ at each time step. The core policy improvement relies on comparing KL divergence between current and baseline next-token distributions:

$\Delta_{KL} = D_{KL}(p_{\mathrm{cur}}\|p_{\mathrm{base}})$

where writing is triggered if $\Delta_{KL} > \delta$ or if model confidence exceeds $\alpha$ . This approach achieves state-of-the-art latency–quality trade-offs and does not require offline policy-module training (Guo et al., 1 Jan 2025).

Full-duplex LSLMs implement FSM-driven control tokens for responsive behaviors: [S.SPEAK], [C.SPEAK], [S.LISTEN], [C.LISTEN]. Each step involves maximizing either control-token or content-token probabilities,

$x_t^* = \arg\max_x P(x|x_{<t},a_{\leq t})$

with transition function $\delta(s,c)$ explicitly formalized.

4. Streaming Fusion and Interruption Handling

LSLMs realize simultaneous listening and speaking via streaming fusion approaches. Middle-fusion, injecting listening embeddings into each Transformer layer, shows optimal performance, preserving speech synthesis (WER near baseline) while providing rapid, precise interruption detection (precision/recall/F1 often $\geq97\%$ under noise) (Ma et al., 2024). Early fusion corrupts generation quality, while late fusion yields less robust interruption boundaries.

Interruption handling is enacted via special vocabulary tokens (e.g., IRQ), with the loss function:

$\mathcal{L}_{\mathrm{LS}}(\theta) = \begin{cases} -\sum_{t=1}^{t_{\mathrm{IRQ}}} \log P_\theta(r^q_t | R^q_{1:t-1}, S^p_{1:t-1}, C) & \text{(with interruption)} \ -\sum_{t=1}^{T_{\mathrm{EOS}}} \log P_\theta(r^q_t | R^q_{1:t-1}, S^p_{1:t-1}, C) & \text{(no interruption)} \end{cases}$

Interruptions are detected within 0.5 s and model output ceases accordingly.

5. Optimization Strategies and Accuracy–Latency Trade-Offs

Direct Preference Optimization (DPO) extends LSLM fine-tuning to maximize the Pareto frontier for accuracy and latency. LSLMs sample contrastive pairs, preferring shorter or more accurate traces:

$L_{\rm DPO}(\pi_\theta; \pi_{\rm ref}) = -\,\mathbb{E}_{(x,y_w,y_l)} \left[ \log \sigma \left( \beta \left[\log \frac{\pi_\theta(y_w|x)}{\pi_{\rm ref}(y_w|x)}- \log\frac{\pi_\theta(y_l|x)}{\pi_{\rm ref}(y_l|x)}\right]\right) \right]$

Adding an NLL regularization on the preferred path stabilizes training.

Empirical findings show:

CoT fine-tuning: $\sim$ 2–3× gain on reasoning accuracy
Early reasoning (QC-based, $\theta=0.75$ ): $75\%$ latency reduction, $14\%$ drop in absolute accuracy
DPO for early CoT: restores $3-4\%$ accuracy with minimal latency cost
Length-based DPO: shrinks reasoning traces by $\sim$ 70% (e.g., $50\to15$ tokens) with no accuracy degradation (Shih et al., 8 Oct 2025)

6. Evaluation, Benchmarks, and Performance Metrics

LSLMs are evaluated under scenarios including ARC-E, ARC-C, SIQA, PIQA, GSM8K, LibriSpeech, and multi-agent social deduction environments:

Response latency: subsecond FTED, with $>50\%$ of responses under $500$ ms (Wang et al., 2024).
Interruption precision: LSLM achieves $8\%$ absolute gain over commercial models.
Duplexing robustness: models sustain WER close to vanilla TTS even under heavy noise, and precise turn-taking (F1 up to $98\%$ in controlled settings) (Ma et al., 2024).
Multi-agent games: listening-while-speaking agents double win rates compared to policy-only RL baselines, producing human-like grounded discussions and accurate hidden-state inference (Sarkar et al., 9 Feb 2025).

7. Limitations, Open Questions, and Future Directions

Current limitations include:

Model generalization under real-world accents and background noise.
Turn-taking beyond stop events, overlapping talkers, and multi-modal interruptions.
Computation overhead at inference, scaling for very large LLMs (quantization and sparse prediction are plausible future remedies) (Deng et al., 2024).

Open directions include integrating domain-specific adapters for speech or dialogue, extending LSLM capabilities to continuous multi-turn, cross-lingual, and multimodal (e.g., audio-visual) settings, and refining policy and preference learning for richer simultaneous interaction. Empirical extension to real recordings, low-resource scenarios, and rapid resetting for dynamic dialogue segmentation remain essential avenues.

LSLMs represent the foundational shift toward continuous, responsive, and reasoning-capable spoken language systems, leveraging multi-stream architectures, explicit reasoning triggers, streaming fusion for interruption resilience, and advanced optimization techniques to balance accuracy and latency for real-world deployment (Shih et al., 8 Oct 2025, Yuen et al., 2024, Ma et al., 2024, Wang et al., 2024, Guo et al., 1 Jan 2025, Deng et al., 2024, Sarkar et al., 9 Feb 2025, Novitasari et al., 2020).