Qwen3-TTS Series: Advanced Multilingual TTS
- Qwen3-TTS Series is a family of advanced multilingual text-to-speech models that integrate zero-shot voice cloning, description-based control, and ultra-low latency synthesis through a dual-track autoregressive LM architecture.
- It employs innovative dual-track processing with separate text and acoustic tokenizers (25Hz and 12Hz), enabling efficient, high-fidelity streaming and rapid voice cloning.
- Benchmark results demonstrate state-of-the-art performance in intelligibility, speaker similarity, and controllable prosody, backed by extensive multilingual training and robust data stratification.
Qwen3-TTS Series is a family of advanced multilingual, controllable, robust, and streaming text-to-speech (TTS) models designed to unify zero-shot voice cloning, description-based control, and ultra-low-latency synthesis within a single autoregressive framework. The Qwen3-TTS models leverage a dual-track LLM (LM) architecture with two specialized speech tokenizers to achieve state-of-the-art performance across objective and subjective benchmarks. The series supports precise style conditioning, rapid voice cloning from mere seconds of reference speech, and high-fidelity streaming generation, all backed by large-scale multilingual training data. All primary variants, model weights, and tokenizers are released under the Apache 2.0 license, fostering broad community adoption and research (Hu et al., 22 Jan 2026).
1. Dual-Track Autoregressive LM Architecture
Qwen3-TTS employs a decoder-only Transformer backbone that interleaves two distinct tracks:
- Text Track: Standard Qwen token embeddings representing input text, with a special “no-text” token signifying text sequence completion.
- Acoustic Track: Speech tokens discretized by one of the Qwen-TTS-Tokenizers.
At each LM step , the model ingests the current text token (or the “no-text” token) together with previously generated acoustic tokens , and immediately predicts the next group of acoustic tokens . For Qwen3-TTS-25Hz, is a single code; for Qwen3-TTS-12Hz, comprises residual codebook entries. In the 12Hz variant, code prediction is two-staged: a lightweight linear head predicts the first (semantic) codebook, then a Multi-Token Prediction (MTP) block, conditioned on the semantic code, predicts the residual codebooks in parallel.
Joint training is performed with the negative log-likelihood of the true acoustic sequence given the text and any control tokens:
where is standard cross-entropy and is the one-hot ground-truth code(s).
2. Speech Tokenization: Qwen-TTS-Tokenizer-25Hz & 12Hz
Qwen3-TTS is distinguished by its use of two discrete speech tokenizers:
- Qwen-TTS-Tokenizer-25Hz: A single-codebook vector quantization (VQ) codec with entries (15 bits per code) at 25 frames per second (FPS). The resulting bitrate is:
Streaming waveform reconstruction is achieved via block-wise Flow Matching diffusion (DiT) and BigVGAN, with codes grouped in blocks of 8 (320 ms window), a 4-block receptive field in DiT (3 past, 1 lookahead), and BigVGAN providing a fixed 130 ms right-context lookahead.
- Qwen-TTS-Tokenizer-12Hz: Two-stream residual vector quantization (RVQ) with 1 semantic codebook (size 2,048, 11 bits) and 15 residual codebooks (also size 2,048). Total codes per frame ; per-frame bits . At 12.5 FPS:
The encoder and ConvNet decoder are purely causal; each 80 ms frame is decoded immediately, with inference amortized via emission of 4 tokens per packet (320 ms total latency).
| Variant | Bitrate | Tokenizer Type | Latency (first packet) |
|---|---|---|---|
| 25Hz | 375 bit/sec | Single-codebook VQ | Block-wise DiT, ~320ms |
| 12Hz | 2.2 kbit/sec | 16 codebooks RVQ + causal ConvNet | 97ms LM + 4ms tokenizer |
3. Voice Cloning and Description-Based Control
Qwen3-TTS provides two leading mechanisms for speaker and style control:
- 3-Second Voice Cloning: A lightweight speaker encoder, trained jointly with the backbone LM, extracts a fixed hidden speaker embedding from a 3-second reference utterance. During fine-tuning (Stage 3), the base LM can be adapted to a target speaker by minimizing:
- Description-Based Control: Natural language prompts, using the ChatML format (e.g., "a warm elderly male voice with slow tempo"), are prepended to the text track as learned control token embeddings (e.g., 〈style:warm〉, 〈pitch:+2〉). The LM is conditioned on these tokens for fine-grained manipulation of output prosody, timbre, and style.
A plausible implication is that Qwen3-TTS’s design supports both the synthesis of entirely novel voices and highly specific speaker, style, and prosody edits based on description or small reference samples.
4. Multilingual Training and Data Stratification
Pretraining comprises over 5 million hours of speech data spanning 10 languages (including CommonVoice, Librispeech, and additional audiobook/news corpora). Training is organized into three stages:
- S1—General Pretraining: Large multilingual dataset pooled without adapters; model learns parameter sharing across languages.
- S2—Quality Stratification: Data scored using ASR models; high-quality utterances are upsampled, noisy data downsampled for robustness.
- S3—Long-Context Training: Maximum text+code input length increased from 8,192 to 32,768 tokens; long speech segments are upsampled to improve handling of extended utterances.
Balanced sampling ensures comprehensive non-English language representation without dedicated adapters.
5. Objective and Subjective Performance Benchmarks
Qwen3-TTS demonstrates state-of-the-art results in diverse evaluation scenarios:
- Streaming Efficiency: On vLLM + CUDA Graph, Qwen3-TTS-12Hz-0.6B achieves first-packet latency of 97 ms (93 ms LM + 4 ms tokenizer), and the 1.7B variant achieves 101 ms.
- Zero-Shot Cloning (Seed-TTS test, Word Error Rate ↓):
- Seed-TTS baseline: 1.12 (ZH) | 2.25 (EN)
- 25Hz-1.7B: 1.10 | 1.49
- 12Hz-1.7B: 0.77 | 1.24
- Multilingual Test (10 languages): Qwen3-TTS outperforms MiniMax-Speech and ElevenLabs on 6/10 languages for intelligibility (WER↓) and all languages for speaker similarity (SIM↑).
- Controllable Generation (InstructTTSEval), Zh (12Hz-1.7B-VD):
- APS↑ = 85.2%
- DSD↑ = 81.1%
- RP↑ = 65.1%
- Target-Speaker Editing: +28% APS over GPT-4o-mini-tts (ZH)
- Long Speech Synthesis (≥10 min), WER↓:
- VibeVoice: 22.6 (ZH) | 1.78 (EN)
- 25Hz-1.7B: 1.517 | 1.225
- Cross-Lingual Cloning (zh→ko, MixER↓):
- CosyVoice3: 14.4%
- Qwen3-TTS: 4.82%
This suggests that Qwen3-TTS excels not only in low-latency generation but also in high-fidelity voice preservation, multilingual accuracy, style transfer, and robustness for extended-duration synthesis.
6. Deployment, Streaming Pipeline, and Licensing
Deployment utilizes a streaming inference pipeline comprising:
- LM decoding with vLLM V0 (torch.compile + CUDA Graph)
- Token-to-waveform conversion using block-wise DiT + BigVGAN (25Hz) or a purely causal ConvNet decoder (12Hz)
- Service SLA for API: first-packet emission ≈ 100 ms, steady-state real-time factor (RTF) ≤ 0.5 at six-way concurrency.
Released artifacts (Apache 2.0 license) include:
- Qwen3-TTS models (0.6B, 1.7B sizes; both 12 Hz and 25 Hz tokenization variants)
- Both tokenizers (25Hz & 12Hz)
- Seamless integration with Qwen-Audio via a shared audio encoder
A plausible implication is that the public release of all components under a permissive license allows rapid experimentation, benchmarking, and integration in commercial and community contexts.
7. Significance and Synthesis
Qwen3-TTS Series constitutes a unified framework addressing multiple long-standing challenges in TTS: zero-shot voice cloning with minimal reference, fine-grained control via natural language or symbolic style tokens, robust and balanced multilingual performance, and streaming generation with extremely low latency. The dual-track LM and hybrid discrete tokenization enable a design that balances expressive capacity with practical efficiency, yielding leading results across intelligibility, similarity, and style controllability benchmarks (Hu et al., 22 Jan 2026).