Papers
Topics
Authors
Recent
Search
2000 character limit reached

Qwen3-TTS Series: Advanced Multilingual TTS

Updated 24 January 2026
  • Qwen3-TTS Series is a family of advanced multilingual text-to-speech models that integrate zero-shot voice cloning, description-based control, and ultra-low latency synthesis through a dual-track autoregressive LM architecture.
  • It employs innovative dual-track processing with separate text and acoustic tokenizers (25Hz and 12Hz), enabling efficient, high-fidelity streaming and rapid voice cloning.
  • Benchmark results demonstrate state-of-the-art performance in intelligibility, speaker similarity, and controllable prosody, backed by extensive multilingual training and robust data stratification.

Qwen3-TTS Series is a family of advanced multilingual, controllable, robust, and streaming text-to-speech (TTS) models designed to unify zero-shot voice cloning, description-based control, and ultra-low-latency synthesis within a single autoregressive framework. The Qwen3-TTS models leverage a dual-track LLM (LM) architecture with two specialized speech tokenizers to achieve state-of-the-art performance across objective and subjective benchmarks. The series supports precise style conditioning, rapid voice cloning from mere seconds of reference speech, and high-fidelity streaming generation, all backed by large-scale multilingual training data. All primary variants, model weights, and tokenizers are released under the Apache 2.0 license, fostering broad community adoption and research (Hu et al., 22 Jan 2026).

1. Dual-Track Autoregressive LM Architecture

Qwen3-TTS employs a decoder-only Transformer backbone that interleaves two distinct tracks:

  • Text Track: Standard Qwen token embeddings representing input text, with a special “no-text” token signifying text sequence completion.
  • Acoustic Track: Speech tokens discretized by one of the Qwen-TTS-Tokenizers.

At each LM step tt, the model ingests the current text token xtx_t (or the “no-text” token) together with previously generated acoustic tokens {c1,,ct1}\{c_1, \ldots, c_{t-1}\}, and immediately predicts the next group of acoustic tokens ctc_t. For Qwen3-TTS-25Hz, ctc_t is a single code; for Qwen3-TTS-12Hz, ctc_t comprises NQN_Q residual codebook entries. In the 12Hz variant, code prediction is two-staged: a lightweight linear head predicts the first (semantic) codebook, then a Multi-Token Prediction (MTP) block, conditioned on the semantic code, predicts the NQ1N_Q-1 residual codebooks in parallel.

Joint training is performed with the negative log-likelihood of the true acoustic sequence given the text and any control tokens:

L=t=1TlogP(ctx1:L,c<t)=t=1TCE(yt,  P(x,c))\mathcal{L} = - \sum_{t=1}^T \log P\bigl(c_t \mid x_{1:L},\,c_{<t}\bigr) = \sum_{t=1}^T \mathrm{CE}\bigl(y_t,\;P(\cdot\mid x,c)\bigr)

where CE\mathrm{CE} is standard cross-entropy and yty_t is the one-hot ground-truth code(s).

2. Speech Tokenization: Qwen-TTS-Tokenizer-25Hz & 12Hz

Qwen3-TTS is distinguished by its use of two discrete speech tokenizers:

R25=15bits/code×25codes/sec=375bit/secR_{25} = 15 \, \mathrm{bits/code} \times 25 \, \mathrm{codes/sec} = 375 \, \mathrm{bit/sec}

Streaming waveform reconstruction is achieved via block-wise Flow Matching diffusion (DiT) and BigVGAN, with codes grouped in blocks of 8 (320 ms window), a 4-block receptive field in DiT (3 past, 1 lookahead), and BigVGAN providing a fixed 130 ms right-context lookahead.

  • Qwen-TTS-Tokenizer-12Hz: Two-stream residual vector quantization (RVQ) with 1 semantic codebook (size 2,048, 11 bits) and 15 residual codebooks (also size 2,048). Total codes per frame NQ=16N_Q=16; per-frame bits =16×11=176= 16 \times 11 = 176. At 12.5 FPS:

R12=176bits/frame×12.5frames/sec=2200bit/sec(2.2kbit/sec)R_{12} = 176 \, \mathrm{bits/frame} \times 12.5 \, \mathrm{frames/sec} = 2200 \, \mathrm{bit/sec} \, (2.2\, \mathrm{kbit/sec})

The encoder and ConvNet decoder are purely causal; each 80 ms frame is decoded immediately, with inference amortized via emission of 4 tokens per packet (320 ms total latency).

Variant Bitrate Tokenizer Type Latency (first packet)
25Hz 375 bit/sec Single-codebook VQ Block-wise DiT, ~320ms
12Hz 2.2 kbit/sec 16 codebooks RVQ + causal ConvNet 97ms LM + 4ms tokenizer

3. Voice Cloning and Description-Based Control

Qwen3-TTS provides two leading mechanisms for speaker and style control:

  • 3-Second Voice Cloning: A lightweight speaker encoder, trained jointly with the backbone LM, extracts a fixed hidden speaker embedding ss from a 3-second reference utterance. During fine-tuning (Stage 3), the base LM can be adapted to a target speaker by minimizing:

Lclone=tlogP(ctx,s,c<t)+λssref2\mathcal{L}_{\text{clone}} = -\sum_t \log P\bigl(c_t \mid x, s, c_{<t} \bigr) + \lambda \|s - s_{\text{ref}}\|^2

  • Description-Based Control: Natural language prompts, using the ChatML format (e.g., "a warm elderly male voice with slow tempo"), are prepended to the text track as learned control token embeddings (e.g., 〈style:warm〉, 〈pitch:+2〉). The LM is conditioned on these tokens for fine-grained manipulation of output prosody, timbre, and style.

A plausible implication is that Qwen3-TTS’s design supports both the synthesis of entirely novel voices and highly specific speaker, style, and prosody edits based on description or small reference samples.

4. Multilingual Training and Data Stratification

Pretraining comprises over 5 million hours of speech data spanning 10 languages (including CommonVoice, Librispeech, and additional audiobook/news corpora). Training is organized into three stages:

  • S1—General Pretraining: Large multilingual dataset pooled without adapters; model learns parameter sharing across languages.
  • S2—Quality Stratification: Data scored using ASR models; high-quality utterances are upsampled, noisy data downsampled for robustness.
  • S3—Long-Context Training: Maximum text+code input length increased from 8,192 to 32,768 tokens; long speech segments are upsampled to improve handling of extended utterances.

Balanced sampling ensures comprehensive non-English language representation without dedicated adapters.

5. Objective and Subjective Performance Benchmarks

Qwen3-TTS demonstrates state-of-the-art results in diverse evaluation scenarios:

  • Streaming Efficiency: On vLLM + CUDA Graph, Qwen3-TTS-12Hz-0.6B achieves first-packet latency of 97 ms (93 ms LM + 4 ms tokenizer), and the 1.7B variant achieves 101 ms.
  • Zero-Shot Cloning (Seed-TTS test, Word Error Rate ↓):
    • Seed-TTS baseline: 1.12 (ZH) | 2.25 (EN)
    • 25Hz-1.7B: 1.10 | 1.49
    • 12Hz-1.7B: 0.77 | 1.24
  • Multilingual Test (10 languages): Qwen3-TTS outperforms MiniMax-Speech and ElevenLabs on 6/10 languages for intelligibility (WER↓) and all languages for speaker similarity (SIM↑).
  • Controllable Generation (InstructTTSEval), Zh (12Hz-1.7B-VD):
    • APS↑ = 85.2%
    • DSD↑ = 81.1%
    • RP↑ = 65.1%
    • Target-Speaker Editing: +28% APS over GPT-4o-mini-tts (ZH)
  • Long Speech Synthesis (≥10 min), WER↓:
    • VibeVoice: 22.6 (ZH) | 1.78 (EN)
    • 25Hz-1.7B: 1.517 | 1.225
  • Cross-Lingual Cloning (zh→ko, MixER↓):
    • CosyVoice3: 14.4%
    • Qwen3-TTS: 4.82%

This suggests that Qwen3-TTS excels not only in low-latency generation but also in high-fidelity voice preservation, multilingual accuracy, style transfer, and robustness for extended-duration synthesis.

6. Deployment, Streaming Pipeline, and Licensing

Deployment utilizes a streaming inference pipeline comprising:

  • LM decoding with vLLM V0 (torch.compile + CUDA Graph)
  • Token-to-waveform conversion using block-wise DiT + BigVGAN (25Hz) or a purely causal ConvNet decoder (12Hz)
  • Service SLA for API: first-packet emission ≈ 100 ms, steady-state real-time factor (RTF) ≤ 0.5 at six-way concurrency.

Released artifacts (Apache 2.0 license) include:

  • Qwen3-TTS models (0.6B, 1.7B sizes; both 12 Hz and 25 Hz tokenization variants)
  • Both tokenizers (25Hz & 12Hz)
  • Seamless integration with Qwen-Audio via a shared audio encoder

A plausible implication is that the public release of all components under a permissive license allows rapid experimentation, benchmarking, and integration in commercial and community contexts.

7. Significance and Synthesis

Qwen3-TTS Series constitutes a unified framework addressing multiple long-standing challenges in TTS: zero-shot voice cloning with minimal reference, fine-grained control via natural language or symbolic style tokens, robust and balanced multilingual performance, and streaming generation with extremely low latency. The dual-track LM and hybrid discrete tokenization enable a design that balances expressive capacity with practical efficiency, yielding leading results across intelligibility, similarity, and style controllability benchmarks (Hu et al., 22 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Qwen3-TTS Series.