SpeechLMs: Unified Speech and Text Modeling

Updated 4 February 2026

SpeechLMs are unified neural architectures that simultaneously process speech, text, and paralinguistic cues without intermediate text conversion.
They employ coupled, semi-decoupled, and fully decoupled tokenization strategies to enhance cross-modal alignment and reduce computational overhead.
Scalable training strategies, including warm initialization, joint fine-tuning, and reinforcement learning, drive improved performance in ASR, translation, and dialogue applications.

Speech LLMs (SpeechLMs) are foundational neural architectures that process, understand, and generate speech in an end-to-end manner, tightly coupling linguistic and paralinguistic information without the lossy intermediate text representation that characterizes classical ASR→LLM→TTS pipelines. Research in SpeechLMs has rapidly advanced from single-turn spoken question answering to highly multimodal, multi-turn, and instruction-following dialogue systems. Current designs enable not only typical automatic speech recognition (ASR), speech-to-text/translation (AST), speech-based question answering, and text-to-speech (TTS), but also emergent capabilities such as prosody control, speaker identity handling, and mixed-modal dialogue. SpeechLMs now underpin real-time digital agents, cross-modal understanding frameworks, and specialized applications such as medical consultation.

1. Core Architecture and Unified Modeling Paradigm

Modern SpeechLMs typically employ a decoder-only Transformer backbone—initialized from a text LLM—to autoregressively model sequences of discrete units representing speech, text, or jointly interleaved modalities. The unified modeling paradigm posits that any speech or text task can be cast as next-token prediction over a single, modular stream of tokens. Inputs may include:

Speech waveform, quantized into discrete tokens (semantic or acoustic).
Text tokens from a standard subword tokenizer.
Special control or task-identifying tokens.

The core sequence-to-sequence model factorizes the output probability as:

$p(X) = \prod_{t=1}^T p(x_t \mid x_{<t})$

Encoding for speech and text is harmonized by either embedding discrete speech units into the LLM’s token space, or via modality bridging adapters that project continuous speech features (such as from a Conformer, Whisper, or custom SSL encoder) to match LLM embedding dimensions (Peng et al., 2024, Lu et al., 2024, Tian et al., 21 Feb 2025, Qian et al., 27 Jul 2025).

A generic, modular, multi-stream approach is often used, where semantic, acoustic, and text tokens are delay-interleaved or otherwise multiplexed within the global context window (Tian et al., 21 Feb 2025, Tian et al., 21 Jun 2025).

Modular components include:

Component	Purpose	Examples
Speech Tokenizer	Quantizes waveform to discrete units	HuBERT, EnCodec
Speech Encoder	Extracts high-level features	Conformer, Whisper
Modality Adapter	Maps speech features to LLM space	FFN, CNN, Q-former
LLM Backbone	Unified next-token modeling	Gemma, Llama, SmolLM2
Linear/LoRA Adaptation	Efficient adaptation without catastrophic forgetting	LoRA, Adapter layers
Vocoder/Decoder	Synthesizes waveform from tokens	HiFi-GAN, neural codec

This architecture allows SpeechLMs to condition on arbitrary interleavings of speech and text, unify training objectives, and flexibly target a wide range of tasks and modalities (Tian et al., 21 Feb 2025, Tian et al., 21 Jun 2025, Peng et al., 2024).

2. Tokenization Strategies and Modality Bridging

The tokenization of speech is central to SpeechLM performance. Strategies fall into several classes:

Coupled Tokenization: A single codebook (codec, e.g., EnCodec) encodes both semantics and acoustics, but entangles linguistic and paralinguistic cues, challenging alignment with text (Fan et al., 14 Jun 2025).
Semi-Decoupled: Two-stage VQ tokenizers preserve partial separation—commonly, HuBERT provides semantic codes and residual codes capture acoustic details (Fan et al., 14 Jun 2025).
Fully Decoupled: Each frame yields distinct semantic and acoustic tokens—semantic tokens align isomorphically to text, while acoustic tokens capture speaker/timbre/prosody (Fan et al., 14 Jun 2025). This enables better cross-modal alignment, faster convergence, and improved speaker invariance.

Long-sequence compression is tackled by segment-to-token mapping via alignment-aware strategies. Techniques such as SSR-Connector segment and compress speech features to match text (e.g., using monotonic aligners or CTC/boundary detection) (Tan et al., 2024). Others, such as SyllableLM, leverage self-supervised boundary detection and iterative distillation (SylBoost) to form coarse, syllable-like semantic units at rates as low as 5 Hz, reducing sequence length and compute footprint while maintaining downstream modeling fidelity (Baade et al., 2024).

Prosody encoding utilizes discrete, human-interpretable vectors per word (e.g., duration, F0 range/median/slope, energy) to embed nuanced prosodic control directly into the token stream, enabling content-prosody disentanglement and fine-grained expressivity (Qian et al., 27 Jul 2025).

3. Supervised and Transfer Learning Recipes

SpeechLMs exploit several pretraining and adaptation methodologies:

Cold Initialization: All model weights are trained from scratch using self-supervised objectives on raw speech units (masked language modeling, contrastive predictive coding) (Cuervo et al., 2024). This approach exhibits slow scaling and massive data requirements to achieve semantic competence comparable to text LLMs.
Warm Initialization (Transfer from Text LMs): Pretrained LLM backbones are adapted by swapping in speech tokenizers and reinitializing only the embedding layer (TWIST), or by inserting low-rank (LoRA) adapters in every linear layer (Hassid et al., 2023, Peng et al., 2024). Warm init accelerates convergence and upholds scaling properties and data efficiency.
Single-Stage Joint Fine-Tuning: Interleaving heterogeneous data types (text-only, ASR, AST, speech-based QA, mixed speech/text) in a single loop combat catastrophic forgetting and preserve original LLM performance on text tasks while adding speech competence (Peng et al., 2024). This holistic SFT recipe is critical for emergent multimodal capabilities.
Multi-Stage and Modality-Aligned Training: Some approaches decouple capability injection (large-scale text SFT on task-specific datasets) and modality realignment (modest speech SFT with paired data), which is highly efficient for domain adaptation (e.g., medical consultation) with minimal speech data requirements (Chen et al., 8 Jan 2026).
Reinforcement Learning for Alignment: Exposure bias and inter-modal discrepancy are mitigated by RL algorithms such as Direct Preference Optimization (DPO), leveraging synthetic teacher-generated data, with reward models scoring the instruction-following fidelity (Liu et al., 25 Aug 2025).
Descriptive Speech-Text Alignment: Speech captioning datasets, generated via LLMs from speech metadata, align complex speech features with their text descriptions, enhancing generalization and zero-shot performance on instruction-following and paralinguistic tasks (Lu et al., 2024).

Synthetic interleaving of speech-text spans, either via TTS-augmented web text or span-masked token replacements, greatly boosts cross-modal representation learning and coverage of rare domains (Zeng et al., 2024, Udandarao et al., 22 Oct 2025).

4. Data Curation, Scaling, and Efficiency

Empirical findings indicate that both model scale and data scale are paramount for strong linguistic and semantic competence in SpeechLMs. Key principles include:

Scaling Laws: SpeechLMs obey “Chinchilla-style” power-law scaling in loss with respect to parameter count and token count (exponents ≈0.25, slower than text LLMs at ≈0.35). To match LLM-level semantic accuracy, one to two orders of magnitude more compute and data are necessary for speech-only pretraining (Cuervo et al., 2024).
Data-Centric Curation: Fine-grained chunking (avoid merging speaker segments), deterministic alternation of text/speech tokens, and domain-aware synthetic data construction (QA-format, rich in reasoning) each contribute quantifiable performance gains (>+7 pp SQA, +1 pp alignment, up to +10 pp over threefold larger models) (Udandarao et al., 22 Oct 2025).
Synthetic Data Generation: Large-scale TTS-backed synthetic corpora boost coverage of knowledge, reasoning, and rare speech/event types. Span-masked synthetic interleaving achieves coverage at scale in the absence of large parallel datasets (Zeng et al., 2024).
Tokenization Granularity: Coarse tokenization (syllable-level) drastically reduces training compute and wall-clock inference speed, with only modest hits to semantic accuracy; overly coarse units (unigram, 5k SentencePiece) degrade performance and saturate quickly on semantic tasks (Baade et al., 2024, Cuervo et al., 2024).
Efficiency-Driven Engineering: Modular toolkits (ESPnet-SpeechLM, OpusLM) automate preprocessing, multi-stream tokenization, distributed training, and evaluation, ensuring reproducibility and open-source accessibility (Tian et al., 21 Feb 2025, Tian et al., 21 Jun 2025).

Table: Impact of Data and Model Choices on SQA Benchmark (from (Udandarao et al., 22 Oct 2025))

Model	Size (B)	Text+Speech SQA (%)	Δ vs Baseline
Qwen-Audio	8.4	40.7	—
Kimi-Audio	10.5	41.6	—
SpeLangy	3.8	51.8	+10.2

5. Evaluation Metrics, Capabilities, and Benchmarks

SpeechLM evaluation encompasses a broad suite of metrics:

ASR Quality: Word Error Rate (WER), character error rate (CER) on standardized datasets (LibriSpeech, CommonVoice, FLEURS).
Speech Translation: BLEU on speech-to-text and speech-to-speech translation tasks (FLEURS, CoVoST2).
Spoken QA and Reasoning: Accuracy on spoken QA tasks (Spoken-Web-Questions, LlamaQ, TriviaQA, StoryCloze), assessed by log-likelihood ranking, GPT-4 scoring, or human evaluation (Zeng et al., 2024, Udandarao et al., 22 Oct 2025, Peng et al., 2024).
Mixed-Modal and Emergent Reasoning: Multi-turn, mixed-modal IFEval strict accuracy, robustness to interleaved/unseen prompts, instruction-following, and generalization to novel output formats or language pairs (Peng et al., 2024).
Speech Synthesis Quality: MOS (Mean Opinion Score), UTMOS, speaker similarity, style transfer accuracy, and prosody control in TTS or SVS settings (Qian et al., 27 Jul 2025, Zhao et al., 16 Dec 2025).
Cross-modal and Zero-/Few-Shot: StoryCloze and MMLU for zero-shot generalization; cross-modal success on S→T, T→S, as well as downstream human-judged metrics (Lu et al., 2024).

6. Specialized Applications, Limitations, and Open Challenges

Specialized SpeechLMs: Domain adaptation with extremely limited in-domain speech (10k samples) is feasible via two-stage adaptation (injection of knowledge via text, then modality realignment via limited speech), achieving superior benchmark performance in medical consultation scenarios (Chen et al., 8 Jan 2026).
Paralinguistic and Prosodic Reasoning: Emergent abilities to manipulate and interpret prosody, emotion, or speaker-specific information are achieved using explicit prosody-tokenization or speaker-aware conditioning; hybrid semantic/acoustic modeling enables robust style and voice transfer (Qian et al., 27 Jul 2025, Fan et al., 14 Jun 2025).
Bias and Fairness: Integrated SpeechLMs may exhibit unintended gender differentiation and stereotype reinforcement, largely inherited from upstream speech encoders (e.g., Whisper), even after neutralization and backbone isolation. Diagnostic analyses reveal that current pipelines overprioritize general fairness at the expense of context-appropriate personalization (Choi et al., 25 Sep 2025).
Streaming and Deployment: Highly efficient serving and streaming technologies (VoxServe) abstract model execution, decouple inference pipeline stages, and optimize for low-latency, high-throughput real-time human–AI interaction (Kamahori et al., 30 Jan 2026).
Open Problems: End-to-end differentiable training (from waveform to waveform), optimized tokenization tradeoffs, cross-lingual and low-resource generalization, multi-modal grounding (vision, video), and robust RLHF tuning remain open. Scaling law disparities between speech and text LMs pose resource constraints for pure speech-first models, making warm-start or hybrid approaches the pragmatic path forward (Cuervo et al., 2024).

Speech LLMs have evolved from basic next-unit prediction over speech tokens to unified, cross-modal frameworks supporting highly nuanced mixed-modality conversational AI. Advances in model architecture, tokenization, scaling, and data curation have yielded systems capable of simultaneous excellence on ASR, speech translation, spoken QA, instruction following, and prosody generation, while highlighting new challenges in fairness, bias, efficiency, and multi-modality. The field is characterized by an interplay of deep architectural innovations, data-centric best practices, and rigorous benchmarking, setting the agenda for future foundational models and end-to-end multimodal AI (Peng et al., 2024, Tan et al., 2024, Lu et al., 2024, Tian et al., 21 Feb 2025, Fan et al., 14 Jun 2025, Udandarao et al., 22 Oct 2025, Liu et al., 25 Aug 2025, Zhao et al., 16 Dec 2025, Qian et al., 27 Jul 2025).