CLaM-TTS: Zero-Shot Text-to-Speech
- CLaM-TTS is a zero-shot text-to-speech system that combines a probabilistic residual VQ-enhanced Mel-VAE with a latent Transformer LM to generate speech efficiently.
- It achieves competitive naturalness, intelligibility, and speaker similarity by emitting multiple tokens in parallel, significantly reducing autoregressive steps.
- The system leverages advanced text encoders like ByT5 and XPhoneBERT, demonstrating robust performance in both monolingual and multilingual settings.
CLaM-TTS (Codec LLM for Text-to-Speech) is a zero-shot TTS system designed to address scalability challenges associated with neural audio codec-based generation. It leverages a combination of probabilistic residual vector quantization within a Mel-spectrogram variational autoencoder (Mel-VAE) and a latent Transformer LLM to achieve high compression efficiency and enable simultaneous multi-token generation. CLaM-TTS delivers competitive or superior results in naturalness, intelligibility, and speaker similarity compared to state-of-the-art neural codec TTS models (Kim et al., 2024).
1. System Architecture
CLaM-TTS integrates two core components: a probabilistic residual vector quantization-enabled Mel-VAE (Mel-VAE w/ PRVQ) and a latent Transformer-based LLM (LM). The architecture operates as follows:
- Stage 1: Mel-VAE (w/ Probabilistic Residual VQ)
- The encoder maps an input mel-spectrogram into a sequence of continuous latent vectors at a reduced temporal rate (e.g., 10 Hz).
- Each is quantized by a depth- probabilistic residual vector quantizer into a tuple of discrete code indices , producing latent code tensors.
- The Mel-VAE decoder reconstructs the output spectrogram from the quantized embeddings .
- Stage 2: Latent LLM
- Input text is embedded using a frozen text encoder (ByT5-large).
- An autoregressive Transformer decoder predicts a mixture-of-Gaussians (MoG) distribution over at each time step, conditioned on and previously generated codes .
- At generation, the latent is sampled, quantized by the learned PRVQ, and all code indices for that timestep are emitted in parallel.
- An EOS head signals termination via a thresholded probability.
The stack of discrete codes is decoded by the Mel-VAE and synthesized as audio using a BigVGAN vocoder.
Key technical parameters include a codeword rate of 10 Hz, RVQ depth with vocabulary size per depth, and a 12-layer Transformer decoder with 16 attention heads and 1,536 hidden dimensions.
End-to-end inference pseudocode:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
def CLaM_TTS_Generate(text x): C = [] for t in 1..T_max: h_t = TransformerDecoder(x, C) p_k = Softmax(W_k @ h_t) k_t = sample(p_k) mu_t = MLP_mu(k_t, h_t) z_t = mu_t + sigma_psi * epsilon # epsilon~N(0,I) c_t = RQ_psi(z_t) C.append(c_t) if sigmoid(W_eos @ h_t) > 0.5: break y_hat = MelVAE_Decoder({sum_{d=1}^D e_psi(c_{t,d}; d)}_{t=1:|C|}) waveform = BigVGAN(y_hat) return waveform |
2. Probabilistic Residual Vector Quantization (PRVQ)
PRVQ in CLaM-TTS extends standard residual vector quantization (RVQ) by learning the codebooks within a variational-inference framework, preventing codebook collapse and improving the utilization of codeword space.
- Standard RVQ iteratively assigns code indices to minimize squared reconstruction error:
The quantized vector is reconstructed as .
- PRVQ instead adopts a mean-field variational posterior:
The optimal code posterior for component is derived from the evidence lower bound (ELBO):
In practice, the expectation is approximated by conditioning on the greedy codes from standard RVQ.
- Empirical Effects: PRVQ yields more uniform usage of codewords and higher reconstruction quality at fixed bit rates. For speech reconstruction, PRVQ achieves PESQ 2.63 (vs. 2.54 for baseline RVQ) and ViSQOL 4.48 (vs. 4.44); when compared to EnCodec (6 kbps), CLaM-TTS demonstrates notable improvements (PESQ 2.95 vs. 2.59, ViSQOL 4.66 vs. 4.26), combining high compression (5 s audio 1,600 tokens at 10 Hz, ) with efficiency: all 32 codes per frame are emitted in parallel, drastically reducing autoregressive steps.
3. Multi-Token Generation Strategy
A principal advantage over prior neural codec LMs is the elimination of cascaded, framewise autoregression over multiple audio token streams. In CLaM-TTS:
- The LLM predicts the entire continuous latent (as a -component MoG) directly at each timestep.
- The shared PRVQ quantizes to the complete tuple in a single step.
As a result, the LM loop proceeds over steps, not or for audio token streams.
Training objective:
with an additional binary EOS cross-entropy loss.
During inference, the LM emits three parallel heads for each step: (1) softmax over mixture index , (2) mixture means , and (3) EOS probability. The process involves sampling and , quantizing, appending codes, and repeating until EOS.
4. Text Encoder Choice and Pretraining Analysis
CLaM-TTS employs a frozen text encoder and systematically benchmarks several T5-derived models:
- T5 (subword)
- mT5 (multilingual)
- ByT5 (byte-level)
- Flan-T5 (instruction-tuned)
- T5-lm-adapt (prefix LM adapted)
- XPhoneBERT (phoneme-based)
Ablation on English continuation (LibriSpeech test-clean):
Best results are obtained with phoneme-based models. Among subword/byte models, ByT5 yields the lowest WER and CER. This indicates that large-scale pretraining on byte-level data enhances LM-to-speech mapping, yet a slight advantage remains for phoneme supervision where available.
5. Experimental Protocol and Quantitative Performance
CLaM-TTS systems are trained in monolingual (CLaM-en; 55K hours English) and multilingual (CLaM-multi; 100K hours, 11 languages) settings. Baselines include YourTTS (VITS-based), Vall-E, SPEAR-TTS, and VoiceBox (non-autoregressive phoneme/duration model).
Evaluation Tasks:
- Continuation: Extend a 3 s speech prompt with matching transcript.
- Cross-sentence: Synthesize new text in the same voice after a 3 s speech prompt.
Metrics:
- Intelligibility: WER, CER (ASR via HuBERT-large, Whisper)
- Speaker Similarity: SIM-o (output–target), SIM-r (reconstruction–target via WavLM)
- Subjective: QMOS, SMOS, CMOS
- Inference speed: Time to produce 10 s on 1 × A100 GPU
Key Results (English continuation):
| Model | WER (%) | CER (%) | SIM-o | SIM-r | Inference (s) |
|---|---|---|---|---|---|
| GT | 2.2 | 0.61 | 0.754 | 0.754 | - |
| YourTTS | 7.57 | 3.06 | 0.393 | - | - |
| Vall-E | 3.8 | - | 0.452 | 0.508 | 6.2 |
| VoiceBox | 2.0 | - | 0.593 | 0.616 | 6.4 |
| CLaM-en | 2.36 | 0.79 | 0.477 | 0.513 | 4.15 |
On cross-sentence tasks, CLaM-en achieves WER 5.11%, CER 2.87%, and SIM-o 0.495 (all second-best among tested models). Subjective evaluations for quality and similarity also reflect gains: QMOS for CLaM-en is 3.87 ± 0.12 (vs. 2.39 ± 0.19 for YourTTS, 4.45 for GT), and SMOS is 3.49 ± 0.14.
In multilingual continuation over 11 languages, WER ranges from 4–20% with SIM-o between 0.42–0.60, showing broad applicability.
6. Trade-offs, Limitations, and Prospects
Trade-offs:
- Lower codeword rates (e.g., 10 Hz) reduce AR steps but can degrade perceptual quality (PESQ), while higher rates boost quality but increase LM computation. Empirically, 8× downsampling is effective.
- PRVQ combats codebook collapse, facilitating denser code usage, though excessive compression increases latent variance and challenges the LM.
Speed vs. Expressiveness:
- Single-pass LM with PRVQ supports ≈4 s/10 s utterance generation versus ≥6 s for cascaded approaches. Non-AR models with phoneme supervision may offer faster inference with fewer diffusion steps but require additional labeling.
Limitations:
- Autoregressive LMs can still exhibit issues such as word omission or repetition; non-AR paradigms (diffusion, flow-matching) may ameliorate these.
- Expressiveness is constrained by predominantly audiobook-style training data; increasing the diversity (conversational, singing) is likely to enhance style control.
- Conditioning on speaker metadata for age, accent, and emotion is promising.
- Further scaling in model size and dataset duration (>100K hours) is a projected avenue for progress toward ground-truth parity.
In summary, CLaM-TTS demonstrates that combining probabilistic RVQ-driven Mel-VAE compression with an efficient, multi-token latent Transformer LM establishes strong zero-shot TTS capabilities in terms of naturalness, intelligibility, speaker similarity, and inference speed, all without reliance on phoneme or duration supervision (Kim et al., 2024).