SALMONN-omni: A Standalone Speech LLM without Codec Injection for Full-duplex Conversation

Published 17 May 2025 in cs.CL and cs.AI | (2505.17060v1)

Abstract: In order to enable fluid and natural human-machine speech interaction, existing full-duplex conversational systems often adopt modular architectures with auxiliary components such as voice activity detectors, interrupters, conversation state predictors, or multiple LLMs. These systems, however, suffer from error accumulation across modules and struggle with key challenges such as context-dependent barge-in and echo cancellation. Recent approaches, most notably Moshi, simplify the pipeline by injecting audio codecs into the token space of a single LLM. However, such methods still incur significant performance degradation when operating on the speech rather than text modality. In this paper, we introduce SALMONN-omni, the first single, standalone full-duplex speech LLM that operates without audio codecs in its token space. It features a novel dynamic thinking mechanism within the LLM backbone, enabling the model to learn when to transition between speaking and listening states. Experiments on widely used benchmarks for spoken question answering and open-domain dialogue show that SALMONN-omni achieves at least 30\% relative performance improvement over existing open-source full-duplex models and performs highly competitively to half-duplex and turn-based systems, despite using substantially less training data. Moreover, SALMONN-omni demonstrates strong performance in complex conversational scenarios, including turn-taking, backchanneling, echo cancellation and context-dependent barge-in, with further improvements achieved through reinforcement learning. Some demo conversations between user and SALMONN-omni are provided in the following repository https://github.com/bytedance/SALMONN.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a novel architecture integrating a streaming speech encoder, LLM backbone, and speech synthesizer for efficient full-duplex speech processing.
The model employs a unique codec-free integration and dynamic thinking strategy to enable smooth turn-taking and backchanneling in natural conversations.
Reinforcement learning enhancements yield a 35.9% performance boost in turn-taking and dialogue accuracy, demonstrating improved conversational fluidity.

SALMONN-omni: A Standalone Speech LLM for Full-duplex Conversation

Introduction and Background

SALMONN-omni introduces a novel method for full-duplex speech interaction without the need for codec injection in its token space. It operates as a standalone model capable of simultaneously processing speech input and output, significantly improving conversational AI's fluidity and responsiveness. Traditional duplex systems, using modular architectures, suffer from performance degradation and error accumulation due to the necessity of auxiliary components like voice activity detectors and dialogue controllers. Conversely, SALMONN-omni achieves superior performance by leveraging a new dynamic thinking mechanism that facilitates transitions between listening and speaking states.

The model's innovations lie in integrating a streaming speech encoder, an LLM backbone devoid of codec tokens, and a speech synthesizer. This architecture enhances interaction by interleaving speech and textual embeddings into the LLM, achieving seamless state transitions necessary for natural conversation dynamics like turn-taking and backchanneling.

Figure 1: The architecture of SALMONN-omni. Two input streams, the environment stream and the assistant stream, are processed by the streaming speech encoder.

Methodology

SALMONN-omni employs a comprehensive methodology encompassing several core innovations:

Codec-free Streaming Integration: The streaming speech encoder enables SALMONN-omni to handle long-sequence conversational contexts while fine-tuning embeddings through log-Mel filter bank features fed into Mamba LLM blocks. This approach omits codec tokens, thus preserving the LLM's text-based functionality and avoiding modality gaps.
Full-duplex Processing: A unified stream sequence composed of environment and assistant speech streams is interleaved into the model's processing pipeline. SALMONN-omni accommodates the environmental audio stream while generating assistant responses, effectively managing real-time echo-related challenges through continuous interleaving of input-output dynamics.
Figure 2: Illustration of "Implicit" and "Explicit" thinking strategies. The tokens on the top of the LLM are predicted by speech embeddings, while the bottom ones are predicted by textual embeddings.
Thinking Strategy for State Transitions: An exclusive explicit "thinking" strategy is employed to ensure state transitions occur correctly. State changes (e.g., listening to speaking) are dictated by special tokens correlating with natural language processing, creating an internally responsive system.
Three-stage Training Regime: The training methodology introduces a systematic, staged protocol to solidify connections between model components and optimize performance using reinforcement learning adjustments to ensure the LLM can appropriately manage dialogue interruptions.
Figure 3: Three-stage training strategy for SALMONN-omni.

Experimental Evaluation

SALMONN-omni demonstrates substantial improvements over existing models in both full-duplex and half-duplex scenarios. In key evaluations like spoken QA and open-domain dialogue, it achieves a 35.9% relative performance increase against alternative duplex models.

Turn-Taking Success: Achieving a consistent high rate of success in predicting turn-taking moments showcases the model's adeptness at conversational flow management.
Figure 4: Spoken QA: SALMONN-omni can handle turn-taking in spoken question answering scenarios with "thinking" mechanism.
Barge-in and Backchanneling: Contextual tests highlight the model’s ability to discern between user interruptions and conversational cues, significantly improving user interaction quality.
Figure 5: Open-domain dialogue: SALMONN-omni can handle turn-taking in spoken question answering scenarios with "thinking" mechanism.
Reinforcement Learning Enhancements: Applying reinforcement learning to improve interaction modeling has shown to reduce interruption tendencies, refining overall dialogue engagement dynamics.

Conclusion

SALMONN-omni represents a significant advancement in the development of full-duplex conversational AI systems. Through its novel architecture and methodological advancements, it efficiently addresses the challenges of natural human-like interaction without reliance on audio codec injections, offering promising directions for future AI research in standalone speech models. Its robust performance and innovative training protocols support further exploration into dynamic, real-time conversational capabilities, solidifying SALMONN-omni as a pivotal development in full-duplex AI technologies.