Units-to-Speech (U2S) Overview

Updated 17 February 2026

Units-to-Speech (U2S) is a framework that converts symbolic acoustic units into intelligible speech, central to textless and multimodal applications.
It employs self-supervised learning techniques to extract discrete units via clustering or vector quantization, ensuring phonetic and linguistic fidelity.
U2S architectures integrate advanced vocoders and prosody models to balance acoustic detail with abstract representations, enhancing synthesis quality.

A Units-to-Speech (U2S) system converts symbolic, pseudo-linguistic or acoustic “units” into an intelligible, natural-sounding speech waveform. In contemporary research, these units are typically discrete symbols—obtained without text supervision—encoding phonetic, subword, or mid-level acoustic information derived from raw speech or, in specialized settings, from alternative sensory modalities. U2S is a critical component in textless speech-to-speech and multimodal generative systems, enabling modalities such as direct speech-to-speech translation, silent speech interfaces, and TTS without intermediate text. The following sections survey major classes of U2S system, their mathematical underpinnings, representation choices, generator/vocoder architectures, evaluation regimes, and the principal technical and empirical insights motivating this research direction.

1. Discrete Unit Representations and Extraction

Most state-of-the-art U2S systems leverage discrete representations derived from self-supervised learning (SSL) models such as HuBERT, Wav2Vec2, mHuBERT, or XLS-R. These models project raw audio to high-dimensional latent features, from which discrete units are extracted by clustering or vector quantization:

K-means clustering on SSL features: Frame-level hidden vectors $h_t$ (sampled every 20 ms) are clustered into $K$ centroids, yielding hard cluster assignments $u_t \in \{1,...,K\}$ . Typical $K$ values range from 200 to 1024; optimal $K$ balances phonetic granularity and robustness to noise artifacts (Duret et al., 2024, Kim et al., 2023).
VQ-VAE and related quantizers: Vector-Quantized Autoencoders and their variants directly learn codebooks and quantization assignments end-to-end. This is often used in zero-resource and unsupervised TTS pipelines (Dunbar et al., 2019).
Run-length encoding and deduplication: To ensure compactness and temporal alignment, consecutive repeated unit indices are collapsed, optionally producing a parallel duration or boundary sequence (Chen et al., 2022, Cheng et al., 2023).
Non-acoustic alternatives: For silent-speech and multimodal pipelines, alternative unit sources include ultrasound tongue images (Ultra2Speech: mapping US images to formant trajectories) (Saha et al., 2020), or segmental units from visual grounding models (Hsu et al., 2020).

These procedures yield a temporally aligned sequence of discrete units $u_1, \ldots, u_N$ , which serves as the symbolic input to the U2S generator.

2. Architectural Patterns for Units-to-Speech Synthesis

U2S decoders fall broadly into three categories:

a. Neural Vocoder Architectures

Unit-based HiFi-GAN: The dominant approach is to embed unit indices via a learned lookup $E(u_t)$ and feed these embeddings through a stack of upsampling (transposed-convolutional) blocks and multi-rate residual stacks, culminating with waveform synthesis. Discriminators (multi-period, multi-scale) enforce adversarial and perceptual alignment (Mingote et al., 2023, Kim et al., 2023, Rashidi et al., 16 Nov 2025, Hwang et al., 2024). The generator loss combines adversarial, feature-matching, and STFT or mel-spectrogram reconstruction terms.
Tacotron and variants: Some systems utilize attention-based sequence-to-sequence models (e.g., Tacotron-2), predicting mel-spectrograms from unit sequences before inversion via WaveGlow or HiFi-GAN. This pipeline is frequently used when explicit duration modeling or expressive prosody control is required (Hsu et al., 2020, Chen et al., 2022).
Diffusion-based decoders: Diffusion probabilistic models invert noise to spectrograms conditioned on unit encodings and speaker embeddings. UnitSpeech, for example, integrates a diffusion TTS backbone with a unit encoder, enabling adaptation and rapid voice conversion (Kim et al., 2023).

b. Prosody and Expressivity Modeling

Explicit duration, pitch, and energy prediction: Several U2S systems use cascaded or parallel networks for estimating duration, F0, energy, and voicing for each unit, yielding richer prosodic control and better transferability of rhythm and style (Chen et al., 2022, Cheng et al., 2023, Hwang et al., 2024).
Expressivity encoders and style transfer: FastSpeech2-style models, ECAPA-TDNN embeddings, and Feature-wise Linear Modulation (FiLM) layers enable conditioning on speaker or vocal style, even under domain-heterogeneous or noisy conditions (Hwang et al., 2024).

c. Specialized and Multimodal Decoding

Formant-based synthesis: For articulatory-to-acoustic mapping, systems such as Ultra2Speech predict vocal tract formant trajectories from non-acoustic input (e.g., ultrasound) and synthesize speech via classical formant-based synthesizers (Saha et al., 2020).
Audio-visual speech synthesis: TransFace’s Unit2Lip module jointly generates both speech and synchronized lip movement frames from unit sequences, incorporating a bounded duration predictor and synchronization loss against lip-sync encodings (Cheng et al., 2023).

3. Training, Objective Functions, and Data Regimes

U2S model training typically proceeds in distinct stages and employs specialized objective functions:

Adversarial and feature-matching losses: HiFi-GAN-based decoders are trained with a weighted sum of adversarial, feature-matching, and STFT/mel losses, ensuring fidelity and naturalness across a range of speakers and languages (Mingote et al., 2023, Duret et al., 2024, Rashidi et al., 16 Nov 2025).
Spectrogram and duration losses: Tacotron-style pipelines are optimized with mean-squared error (MSE) on mel-spectrograms, cross-entropy for stop tokens, and duration prediction MSE (Hsu et al., 2020, Chen et al., 2022).
Diffusion and alignment losses: Diffusion models minimize direct score-matching objectives for noisy spectrogram restoration, with additional encoder-decoder alignment and monotonicity losses (Kim et al., 2023).
Prosody and expressivity: Additional BCE and cross-entropy terms model pitch bin distributions, voicing, and energy, allowing style-transfer and robust rhythmic alignment (Chen et al., 2022, Cheng et al., 2023, Hwang et al., 2024).
Self-supervised distillation: DINO-style cross-entropy distillation losses align noisy and clean expressivity embeddings via teacher-student architectures, improving robustness (Hwang et al., 2024).

Transfer and fine-tuning protocols differ: some systems pre-train on large multi-speaker corpora and fine-tune for a new speaker or style with a single $<$ unit, speech $>$ pair, whereas others employ synthetic parallel data to augment low-resource settings (Kim et al., 2023, Rashidi et al., 16 Nov 2025).

4. Evaluation Metrics and Empirical Findings

U2S quality is assessed via several complementary metrics:

Metric	Subject/Objective	Purpose
Mean Opinion Score (MOS)	Subjective	Speech naturalness, fluidity
Character Error Rate (CER)	Objective	Intelligibility, ASR transcribability
BLEU (ASR-bleu)	Objective	Translation/tts intelligibility between ASR hypothesis and reference
Speaker similarity (SMOS/SES/SIM)	Mixed	Voice fidelity
Prosody metrics (AutoPCP, LSE-*)	Mixed	Rhythm, pitch, style transfer, audio-visual sync

Empirical findings include:

HiFi-GAN models conditioned on units achieve MOS up to 3.8 on reconstruction and 4.2+ in speaker conversion (Duret et al., 2024, Kim et al., 2023).
For speech translation, optimal unit choices for U2S (maximizing MOS) often diverge from those optimizing BLEU or CER; thus, S2ST system design must trade off linguistic abstraction and acoustic fidelity (Duret et al., 2024).
Diffusion decoders and explicit prosody networks have enabled radical improvements in voice similarity and style transfer with minimal adaptation data (Kim et al., 2023, Chen et al., 2022).
Self-supervised distillation confers dramatic robustness to channel and environmental noise in expressive S2ST scenarios, outperforming conventional PRETSSEL under SNR degradation (Hwang et al., 2024).
Cascade and direct pipelines perform comparably in high-resource settings, but unit-based models often close the gap in low-resource regimes by decoupling the vocoder from the translation model (Mingote et al., 2023, Rashidi et al., 16 Nov 2025).
In cross-modal and multimodal settings (e.g., silent-speech, textless image-to-speech), learned units enable U2S synthesis without text or phonetic supervision at all (Hsu et al., 2020, Saha et al., 2020).

5. U2S in Multilingual, Textless, and Multimodal Systems

U2S models are a foundational component in textless and multimodal generative pipelines:

Textless S2ST and TTST: U2S enables direct mapping from speech or text in any input language to a discretized intermediate (pseudo-text), from which a single universal vocoder generates waveforms, greatly simplifying many-to-many translation (Kim et al., 2023, Mingote et al., 2023, Zhang et al., 21 May 2025).
Silent and visual speech interface: Non-acoustic “units” such as tongue ultrasound (Ultra2Speech) or visual-codebook units (ResDAVEnet-VQ) extend U2S technology to domains where text or audio is absent (Saha et al., 2020, Hsu et al., 2020).
Audio-visual generation: Joint synthesis models such as TransFace’s Unit2Lip module perform perfectly isochronous length-matching for combined audio/lip-stream generation (Cheng et al., 2023).
Prosody and voice conversion: Unified pipelines integrate unit-based conditioning with one-shot adaptivity in prosody, speaker, and emotion, allowing rapid transfer across speaker, language, and style (Kim et al., 2023, Chen et al., 2022).
Expressivity under noise: DINO-PRETSSEL demonstrates significant gains in noisy and real-world conditions, maintaining voice identity and expressivity even under severe degradation (Hwang et al., 2024).

6. Methodological Insights, Limitations, and Tradeoffs

Key technical and empirical insights include:

Unit design is central: The performance ceiling of U2S is set as much by unit extraction (choice of SSL model, clustering granularity, layer selection) as by the subsequent generator (Duret et al., 2024, Kim et al., 2023).
Decoupling translation and synthesis: By using discrete units as a pivot representation, the same vocoder can serve direct speech-to-speech, text-to-speech, and even multimodal pipelines, supporting modularity and efficient adaptation (Mingote et al., 2023, Zhang et al., 21 May 2025).
Acoustic–linguistic tradeoff: More detailed units (higher $K$ , lower-layer features) typically improve resynthesis but may impair translation or linguistic abstraction; the converse holds for higher-layer, more abstract units (Duret et al., 2024).
Prosody modeling remains an open challenge: While explicit modeling of duration, F0, and energy improves naturalness and expressivity, automatic transfer of prosody, especially under minimal supervision or in unseen styles, is still imperfect (Chen et al., 2022, Kim et al., 2023).
Synthetic data and pretraining regimes: In low-resource language scenarios, pipeline augmentation using synthetic parallel data, pretrained multilingual encoders, and self-supervised adaptation substantially closes performance gaps to high-resource settings (Rashidi et al., 16 Nov 2025, Mingote et al., 2023).
No universal optimum: Empirical optimization of both acoustic and translation metrics is required; no single set of unit extraction parameters is optimal for all downstream purposes (Duret et al., 2024).

7. Historical Perspectives and Applications

U2S technology has evolved from early unsupervised TTS (e.g., ZR19: clustering plus Merlin, Tacotron, or WORLD/Griffin-Lim vocoders) (Dunbar et al., 2019) to present large-scale, multilingual S2ST systems using robust SSL encoders, adversarial/generative vocoders, and explicit prosody adversarial training (Kim et al., 2023, Kim et al., 2023, Chen et al., 2022, Hwang et al., 2024). Applications now include S2ST, expressive/voice-transfer TTS, textless image-to-speech, silent-speech and medical interfaces, and multimodal talking-head translation.

Comprehensive empirical studies and shared benchmark datasets (e.g., Zero Resource Challenge, CVSS, mExpresso, VoxPopuli, LJSpeech, VCTK) provide objective and subjective comparisons along MOS, CER, BLEU, SIM, AutoPCP, and lip-sync metrics.

Ongoing research addresses open challenges in optimizing unit selection, robust cross-lingual transfer, multimodal synthesis synchrony, and the joint design of unit extraction and U2S generative models. The U2S paradigm remains at the core of textless and modular speech generation architectures for both research and applied domains.