Papers
Topics
Authors
Recent
Search
2000 character limit reached

LLaMA-Omni: Real-Time Modular Speech AI

Updated 22 February 2026
  • LLaMA-Omni is a family of modular speech-language models designed for end-to-end, low-latency speech interaction with direct speech-to-speech capability.
  • It integrates pretrained speech encoders, lightweight adaptors, and frozen LLM backbones with a streaming decoder to enable sub-second response times.
  • LLaMA-Omni 2 scales from 0.5B to 14B parameters and demonstrates competitive ASR-WER and latency performance, advancing real-time conversational AI.

LLaMA-Omni is a family of end-to-end modular speech-LLMs (SpeechLMs) designed for real-time, low-latency, and high-quality spoken interaction with LLMs. Progressing from the original LLaMA-Omni, built atop Llama-3.1-8B-Instruct, to LLaMA-Omni 2, which leverages Qwen2.5-Instruct architectures and supports model sizes from 0.5B to 14B parameters, these systems are distinguished by direct speech-to-speech capability, synchronous streaming inference, and competitive conversational performance using only modestly sized training corpora (Fang et al., 2024, Fang et al., 5 May 2025).

1. Model Family Overview and Objectives

LLaMA-Omni and its successor, LLaMA-Omni 2, are engineered to facilitate seamless and efficient audio-based conversational agents. Unlike traditional cascaded ASR→LLM→TTS systems, both iterations process speech input and emit both text and speech responses in a unified, streaming fashion. Key innovations include modularity — integrating pretrained speech encoders (Whisper-large-v3), lightweight trainable adaptors, frozen open-source LLM backbones (Llama-3.1-8B-Instruct, Qwen2.5-Instruct of varying sizes), and streaming speech decoders that obviate the need for intermediate speech transcription. LLaMA-Omni models are designed for user-facing, low-latency interaction with sub-second end-to-end latency and robust real-time performance on spoken language comprehension and generation benchmarks (Fang et al., 2024, Fang et al., 5 May 2025).

2. Architecture and Inference Design

LLaMA-Omni employs the following sequential modules:

  • Speech Encoder (E\mathcal E): A frozen Whisper-large-v3 encoder transforms raw waveform XSX^S to a sequence of hidden states H=[h1,...,hN]H=[h_1, ..., h_N].
  • Speech Adaptor (A\mathcal A): Down-samples HH by concatenating every kk frames (k=5k=5), followed by a two-layer linear MLP with ReLU activations: S=A(H)=Linear2(ReLU(Linear1(H)))S = \mathcal{A}(H) = \mathrm{Linear}_2(\mathrm{ReLU}(\mathrm{Linear}_1(H'))), where H=DownSample(H)H' = \mathrm{DownSample}(H) has length N/k\lfloor N/k \rfloor.
  • LLM Backbone (M\mathcal M): A pretrained Llama-3.1-8B-Instruct model receives prompt templates with inserted speech embeddings SS and autoregressively generates text tokens yTy^T under the cross-entropy objective:

LLLM=i=1MlogP(yiTP(S),y<iT)\mathcal{L}_{\mathrm{LLM}} = -\sum_{i=1}^M \log P(y^T_i | P(S), y^T_{<i})

  • Streaming Speech Decoder (D\mathcal D): A two-layer non-autoregressive Transformer (hidden dim 4096, ff dim 11008, 32 heads) for speech unit generation. Operates on upsampled LLM hidden states C^\hat{C} and outputs logits over K+1K+1 clusters (HuBERT-based, K=1000K=1000) with CTC loss:

LCTC=logAβ1(YU)i=1λMP(aiO)\mathcal{L}_{\mathrm{CTC}} = -\log\sum_{A\in\beta^{-1}(Y^U)}\prod_{i=1}^{\lambda M} P(a_i|O)

During inference, each new text token cic_i triggers parallel speech unit decoding for chunk size λ\lambda. When accumulated units reach the chunk Ω\Omega, they are immediately synthesized by a vocoder, enabling synchronous streaming speech output. This token-level streaming architecture delivers response latencies as low as 226 ms with chunk size Ω=10\Omega=10, outperforming non-modular and cascaded systems (Fang et al., 2024).

LLaMA-Omni 2 introduces further architectural refinements:

  • Built on Qwen2.5-Instruct, with parameter scales from 0.5B to 14B.
  • Speech understanding uses Whisper-large-v3 encoders and a two-layer feed-forward adapter (hidden size 2048), with downsampling by k=5k=5.
  • Output employs an autoregressive streaming decoder with a semantic speech tokenizer using finite scalar quantization (FSQ) over SenseVoice-Large, mapping mel-spectrogram frames to discrete tokens yiU{0,,6560}y^U_i \in \{0, \ldots, 6560\} at 25 tokens/sec.
  • The TTS model MTTS\mathcal{M}_{\mathrm{TTS}} is a Transformer initialized from Qwen2.5-0.5B. Gate fusion combines LLM hidden states and embedded ground-truth tokens:

gi=σ(Wg[eihiddeneiemb]+bg)\mathbf{g}_i = \sigma(W_g [\mathbf{e}_i^{\mathrm{hidden}} \Vert \mathbf{e}_i^{\mathrm{emb}}] + b_g)

ci=gieihidden+(1gi)eiemb\mathbf{c}_i = \mathbf{g}_i \odot \mathbf{e}_i^{\mathrm{hidden}} + (1 - \mathbf{g}_i) \odot \mathbf{e}_i^{\mathrm{emb}}

  • The “Read-R\mathcal{R}/Write-W\mathcal{W}” streaming strategy interleaves text generation and speech token emission for efficient, causal output (Fang et al., 5 May 2025).

3. Training Protocols and Data Construction

LLaMA-Omni utilizes the InstructS2S-200K dataset, comprising 200K speech instructions and corresponding responses:

  • 50K prompts from Alpaca and 150K from UltraChat (single-turn).
  • Instructions are rewritten by Llama-3-70B-Instruct to inject natural speech fillers and verbalized numbers.
  • TTS synthesis uses CosyVoice-300M-SFT for instructions (varied genders) and VITS-LJSpeech for responses.
  • Speech feature quantization leverages HuBERT with KK-means clustering (K=1000K=1000) to produce discrete units for the speech decoder.

A two-stage training protocol is applied:

  1. Speech-to-Text (S2T): Train the speech adapter and LLM with cross-entropy objective; keep encoder and decoder frozen.
  2. Text-to-Speech (T2S): Train the speech decoder using the CTC objective; all other components frozen.

Training on 4×NVIDIA L40 GPUs requires ≈65 hours (<3 days) (Fang et al., 2024).

LLaMA-Omni 2 expands data synthesis:

  • 200K multi-turn dialogues generated from Alpaca & UltraChat. Each dialogue comprises 1–5 turns (sampled NPoisson(2)N \sim \mathrm{Poisson}(2)), expanded using Llama-3.3-70B-Instruct.
  • Instruction turns employ randomly selected speaker voices (“seed prompt” via fish-speech-1.5, cloned in CosyVoice2-0.5B); response turns use a single target voice.
  • No external ASR or TTS corpora beyond these synthesized dialogues.
  • Training occurs in two stages: S2T (cross-entropy, adapter+LLM) and TTS (cross-entropy on speech tokens, with and without gate fusion), all under streaming settings (Fang et al., 5 May 2025).

4. Performance Evaluation and Benchmarks

LLaMA-Omni models are evaluated on content/style with ChatGPT (GPT-4o) ratings (1–5), ASR-WER/CER alignment, UTMOS (predicted MOS), and end-to-end decoding time.

LLaMA-Omni main results on InstructS2S-Eval:

Model S2TIF content/style S2SIF content/style ASR-WER (%) UTMOS Latency (ms)
SpeechGPT 2.59 / 3.15 1.58 / 1.81 47.62
SALMONN + TTS 2.57 / 2.79 2.46 / 2.84 21.77
Qwen2-Audio + TTS 2.73 / 2.64 2.32 / 2.58 55.72
LLaMA-Omni 3.23 / 3.81 2.69 / 3.12 11.61 226

Decoding time: S2TIF = 1.49s, S2SIF = 1.92s for LLaMA-Omni, whereas S2SIF time for SpeechGPT is 25.60s.

LLaMA-Omni 2 highlights:

Model LlamaQ S2T LlamaQ S2S WebQ S2T WebQ S2S ChatGPT S2T ChatGPT S2S ASR-WER UTMOS Latency (ms)
GLM-4-Voice (9B) 64.7 50.7 32.2 15.9 4.16 4.09 9.02 3.48 1562.8
LLaMA-Omni (8B) 67.7 49.0 33.4 23.7 3.99 3.52 5.95 3.67 346.7
Omni2-7B 70.3 60.7 34.5 31.3 4.28 4.15 3.26 4.19 582.9
Omni2-14B 73.0 62.7 40.4 37.1 4.56 4.35 3.89 4.20 663.3

The S2S drop (S2T→S2S) is significantly reduced in LLaMA-Omni 2 compared to earlier systems. UTMOS scores approach 4.2 in streaming mode, with mean latencies well below 1s, substantially outperforming GLM-4-Voice and LLaMA-Omni (Fang et al., 5 May 2025).

5. Latency and Real-Time Streaming Mechanisms

LLaMA-Omni's latency is determined from input endpoint to the playback of the first response frame. For chunk size Ω=10\Omega=10, measured latency is 226 ms, corresponding to a lag of approximately 1.82 text words. Adjusting Ω\Omega shifts the trade-off between latency and UTMOS: lower Ω\Omega yields lower latency and better ASR alignment, while higher Ω\Omega increases perceived speech naturalness at the cost of lag (Fang et al., 2024).

LLaMA-Omni 2’s streaming TTS protocol employs a Read-R\mathcal{R}/Write-W\mathcal{W} strategy: after reading R\mathcal{R} text tokens, the model writes/emits W\mathcal{W} speech tokens, enabling parallel LLM and TTS decoding. For Omni2-7B at R=3\mathcal{R}=3, W=10\mathcal{W}=10, measurements are:

  • TLLM(3)=231.2T_\text{LLM}(3)=231.2 ms
  • TTTS(10)=165.8T_\text{TTS}(10)=165.8 ms
  • TFM+Voc=185.9T_\text{FM+Voc}=185.9 ms
  • Ttotal582.9T_\text{total} \approx 582.9 ms

Even the smallest (0.5B) model achieves approximately 543 ms. This synchronous streaming mechanism ensures that the response can be rendered in real time, with latency that is competitive with or superior to proprietary LLM-based agents (Fang et al., 5 May 2025).

6. Ablation Insights and Trade-offs

Empirical ablation in LLaMA-Omni demonstrates that two-stage training — separating speech comprehension (S2T) from speech generation (T2S) — stabilizes learning. The non-autoregressive streaming decoder, interfaced via CTC, introduces negligible computational overhead (1.28× vs text-only decoding), while traditional cascaded pipelines encounter up to 6× slowdown. The choice of chunk size Ω\Omega is an effective control variable for balancing latency and naturalness. The simple speech adaptor design (down-sampling, lightweight MLP, no SLA/LoRA) is sufficient to align Whisper features with Llama/Qwen embeddings (Fang et al., 2024, Fang et al., 5 May 2025).

7. Limitations and Future Directions

LLaMA-Omni 2 currently provides speech output in a uniform voice, lacking control over paralinguistic qualities such as emotion, prosody, or dialect, irrespective of input speech style. Planned research directions include explicit paralinguistic style conditioning and annotation, as well as fine-tuning the end-to-end pipeline for broader expressive and interactive capabilities. Known risks inherent to underlying LLMs — including potential factual errors or hallucinations — also apply, which renders output validation essential for deployment in safety-critical scenarios (Fang et al., 5 May 2025).

References

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LLaMA-Omni.