Textless Spoken Language Modeling

Updated 17 February 2026

Textless Spoken Language Modeling is a paradigm that converts continuous speech into discrete tokens using advanced audio quantization and self-supervised feature extraction techniques.
It employs autoregressive and masked language modeling strategies to predict token sequences, achieving lower word error rates and improved linguistic coherence.
The approach supports unsupervised learning, cross-lingual processing, and speaker-aware synthesis while addressing challenges in codebook scalability and token alignment.

Textless Spoken Language Modeling (SLM) encompasses a paradigm in which raw speech is modeled directly as a sequence of discrete tokens without reference to text transcriptions or written language resources. Unlike hybrid speech–text systems or speech-aware textual LMs, textless SLMs are characterized by operating entirely within the speech domain—using units derived from audio quantization as the atomic elements of sequence modeling. This approach enables unsupervised and cross-lingual spoken language processing and is critical for expanding language modeling to unwritten and low-resource languages (Arora et al., 11 Apr 2025).

1. Tokenization: Discrete Speech Units and Quantization

Textless SLMs are premised on the transformation of a continuous speech waveform $x\in\mathbb{R}^T$ to a finite sequence of discrete tokens $v = (v_1,\ldots,v_N)$ , $v_t\in V$ , with $|V|$ representing the quantization codebook size. The tokenization process involves:

Feature Extraction: A self-supervised encoder such as HuBERT, wav2vec 2.0, or neural codecs (SoundStream, EnCodec) generates dense frame-level or segment-level representations $h = E(x)$ (Arora et al., 11 Apr 2025).
Quantization:
- k-means on SSL features creates "phonetic tokens" by clustering HuBERT or wav2vec features into $|V|\approx 50$ –$1000$ classes; used in GSLM, TWIST (Arora et al., 11 Apr 2025).
- Vector-Quantized VAEs (VQ-VAE), Residual Vector Quantization: multi-level codebooks (per-level $|V|$ ) optimized for waveform reconstruction, as in AudioLM, UniAudio (Arora et al., 11 Apr 2025).
- ASR-based quantization: framewise assignment using posteriors from ASR encoders, e.g., WhisperSpeech (Arora et al., 11 Apr 2025).

A typical loss for VQ-VAE quantization is $L_\text{quant} = \|E(x) - \text{Quant}(E(x))\|^2 + \beta \sum_{v} \| \mathrm{sg}[E(x)] - e_v\|^2$ , where $e_v$ is a codebook vector and $\mathrm{sg}$ is the stop-gradient operator (Arora et al., 11 Apr 2025).

Empirical results show that discretization is essential, stripping away speaker/channel variations and forcing the Transformer to focus solely on linguistic content. Discrete-unit models outperform continuous-feature models in lexical and syntactic metrics, as demonstrated by HuBERT-based SLMs on Zero Resource Speech Challenge metrics (e.g., sWUGGY = 83.29, sBLIMP = 61.93 for 500 clusters; continuous = 60.56, 53.33) (Nguyen et al., 2022).

2. LLM Architectures and Sequence Modeling

After tokenization, textless SLMs use sequence models to learn the conditional probability of token sequences $P(v_1,\ldots,v_N)$ , predominantly via:

Autoregressive Transformer LMs: Each token is predicted conditioned on all prior tokens: $P(v_1,\ldots,v_N)=\prod_{t=1}^N P(v_t|v_{<t};\theta)$ . The standard objective is negative log-likelihood $L_{LM} = -\sum_{t=1}^N \log P(v_t|v_{<t};\theta)$ (Arora et al., 11 Apr 2025).
Masked/non-autoregressive LMs: Employ MaskGIT-style objectives for faster generation by predicting masked tokens in parallel and refining via several passes (e.g., SoundStorm) (Arora et al., 11 Apr 2025).
Multi-task/Contrastive Extensions: Models such as pGSLM predict both phonetic and prosodic streams in a multi-head setup; encoder-stage contrastive pretraining (e.g., CPC, wav2vec 2.0) is also employed (Arora et al., 11 Apr 2025).

Certain advanced models (e.g., Flow-SLM) extend the architecture to joint generation of semantic tokens and frame-level continuous acoustic vectors using flow-matching objectives, overcoming the limitations of strict two-stage token–vocoder pipelines (Chou et al., 12 Aug 2025).

3. Training Methodologies and Optimization Paradigms

Primary training strategies in textless SLM include:

Self-supervised Pre-training: LMs are trained on large unlabeled corpora (LibriLight, VoxPopuli), seeking to predict the next quantized token in speech without text transcripts (Arora et al., 11 Apr 2025).
Continual Pretraining/Domain Adaptation: Fine-tuning pre-existing LMs from text or speech domains to target sets or languages; curriculum learning supports adaptation to low-resource languages or specific domains by continued prediction training (Arora et al., 11 Apr 2025).
Post-training via Preference Optimization: Align-SLM introduces direct preference optimization (DPO), using AI feedback (e.g., LLM-based ratings on generated speech) to select semantically superior continuations and finetune the LM, thus improving semantic coherence and topic relevance over vanilla likelihood-trained models (Lin et al., 2024).

Multi-token prediction (MTP) addresses the mismatch between dense speech token rates (hundreds per second) and text rates (~20 Hz) by enabling a single hidden state to decode $g$ tokens in parallel, reducing decoding steps (up to 12× speedup) and halving word error rates (WER from 6.07% to 3.01% in FACodec MTP-12H) (Fan et al., 14 Jun 2025).

4. Evaluation Benchmarks and Metrics

Textless SLMs are benchmarked via a variety of intrinsic and extrinsic tasks (Arora et al., 11 Apr 2025):

Metric/Task	Description	Example Score
Perplexity (PPL)	Token-level perplexity over held-out speech	PPL values reported
sWUGGY	Lexical discrimination (real vs. nonce words)	83.29 (discrete)
sBLIMP	Sensitivity to grammaticality	61.93 (discrete)
Speech StoryCloze	Semantic coherence/cloze accuracy	+5.8pp w/ interleaving
Subjective MOS	Human-perceived audio naturalness/meaningfuln.	4.08 (SLIDE-2)
Speaker SIM	Speaker similarity to reference voices	0.60 (MTP-12H, spk-aw.)
WER	Transcription word error rate (%)	3.01 (MTP-12H)

Extrinsic tasks include zero-shot keyword spotting, ASR-free recognition, and speaker role QA (RoleTriviaQA) (Fan et al., 14 Jun 2025, Moumen et al., 1 Dec 2025). Human evaluation (MOS), GPT4-o or Mistral LLM-based scores, and speaker similarity metrics are also frequently used.

5. Extensions: Cross-Lingual, Multitask, and Speaker-Aware SLMs

Recent work expands SLM capabilities in several directions:

Cross-Lingual Interleaving: Concatenation of sentence-aligned speech token sequences from different languages (e.g., EN–FR) in training drives robust cross-lingual alignment, improves semantic transfer (+5.8 points on sSC for EN→FR), and supports multilingual processing without textual supervision (Moumen et al., 1 Dec 2025).
Speech-to-Speech Translation: Multitask SLMs (e.g., MSLM-S2ST) enable end-to-end speech-to-speech translation by jointly modeling semantic and acoustic token streams, preserving speaker style with cosine similarity up to 0.43 and BLEU scores ~24.8 (Es→En) (Peng et al., 2024).
Speaker-Aware Modeling: Integration of pretrained timbre or speaker embeddings (X_user) into the context allows SLMs to control and preserve speaker identity, critical for role-playing, dialogue, and QA scenarios (e.g., SIM rises to 0.60 in MTP-12H with speaker conditioning) (Fan et al., 14 Jun 2025).
Hybrid Integration: Systems such as SLIDE combine LLM-based textual generation with SLM-guided naturalistic speech, achieving M-MOS (meaning–coherence) scores of 4.08 (close to human ground-truth 4.63) while matching natural turn-taking and non-verbal vocalizations (Lu et al., 1 Jan 2025).

6. Limitations, Open Challenges, and Future Directions

Key limitations and research challenges include:

Codebook Scalability: Increasing $|V|$ enhances phonetic granularity but slows training and inference; efficient quantization and decoding strategies are necessary (Arora et al., 11 Apr 2025).
Linguistic Unit Alignment: Many tokenizers lack alignment with phonemes, syllables, or subword units; better linguistic alignment is needed to improve interpretability and cross-lingual transfer (Arora et al., 11 Apr 2025).
Cross-Lingual Transfer and Low-Resource Languages: Most SLMs are still English-centric; approaches such as interleaving and joint training with a high-resource language can partially mitigate data scarcity (Moumen et al., 1 Dec 2025).
Integration with Downstream Tasks: Minimizing cascading errors in pipelines that combine SLMs with ASR/ST/understanding components remains unresolved (Arora et al., 11 Apr 2025).
Unified Benchmarks: Metrics and benchmarks for speech tokenization and modeling remain fragmented; a unified suite covering codebook quality, PPL, sWUGGY, sBLIMP, and generation is lacking (Arora et al., 11 Apr 2025).

Future work aims to scale up both model and data, improve isomorphic alignment between speech/text, advance controllable and expressive speech generation, address prosodic/style alignment, and broaden applicability to more languages, paralinguistic phenomena, and complex interactive tasks (Chou et al., 12 Aug 2025, Lin et al., 2024, Fan et al., 14 Jun 2025).