Qwen3-TTS: Advanced Multilingual Speech Synthesis

Updated 25 January 2026

Qwen3-TTS is an advanced, open-source multilingual text-to-speech system that utilizes a dual-track autoregressive architecture for real-time, streaming speech synthesis.
It supports zero-shot voice cloning and detailed prosody control using a 3-second reference mechanism and descriptive style inputs across 10 languages.
The model suite achieves low latency and high concurrency through efficient tokenizers and a causal Code2Wav decoder, making it ideal for both research and production.

Qwen3-TTS is a family of advanced, open-source, multilingual text-to-speech (TTS) models designed for highly controllable, robust, and real-time streaming speech synthesis. Leveraging over 5 million hours of training data across 10 languages, Qwen3-TTS introduces a dual-track autoregressive LLM architecture and novel speech tokenizers that enable state-of-the-art performance in zero-shot voice cloning, prosody and style control, cross-lingual speaking, and long-form streaming synthesis. It is released under an Apache 2.0 license and is closely integrated with the broader Qwen3-Omni multimodal framework, providing high concurrency and low latency suitable for production and research deployments (Hu et al., 22 Jan 2026, Xu et al., 22 Sep 2025).

1. Architectural Overview

Qwen3-TTS extends the Qwen3 LLM by introducing a dual-track autoregressive framework. At each decoding step, the model concatenates a textual track (input text tokens) and an acoustic track (discrete speech tokens derived from a dedicated TTS tokenizer). Upon receiving a new text token $t_i$ , the model immediately predicts the corresponding block of acoustic tokens $a_i$ , interleaving text and audio token generation for real-time streaming synthesis.

The joint model factorization is:

$P(A \| T) = \prod_k P(a_k \mid a_{<k}, t_{\leq \lceil k/C \rceil}) \cdot \prod_i P(t_i \mid t_{<i}, a_{<\lceil i \cdot C' \rceil})$

where $A$ and $T$ are the acoustic and textual streams, and $C, C'$ are the respective frame-to-token correspondences (Hu et al., 22 Jan 2026).

The generated acoustic tokens are immediately streamed to a Code2Wav decoder—either based on a block-wise Diffusion Transformer (DiT) or a lightweight causal ConvNet—enabling waveform reconstruction with extremely low latency.

In the Qwen3-Omni framework, speech synthesis is handled by a two-part Thinker–Talker MoE transformer stack. The Thinker module produces high-level multimodal features and intermediate text outputs. The Talker module, a 3B MoE transformer (0.3B parameters per token activation), autoregressively predicts speech codebook indices and feeds them to a causal ConvNet for waveform synthesis, supporting fully decoupled text and audio styling (Xu et al., 22 Sep 2025).

2. Voice Cloning and Controllability

Qwen3-TTS provides two primary mechanisms for speech synthesis control: 3-second reference voice cloning and description-based (instructional) prosody and style control.

3-Second Voice Cloning: A learnable speaker encoder $E_s$ maps a 3-second reference wav $w_{ref}$ to a fixed-dimensional speaker embedding $s = E_s(w_{ref})$ . This embedding is incorporated into the LM hidden state $h_t$ , either via concatenation or a learned projection, to condition the acoustic distribution $P(a_k \mid h_t, s)$ , enabling accurate reproduction of timbre and speaker identity from brief samples (Hu et al., 22 Jan 2026).

Description-Based Voice Design: Prosodic, energetic, or stylistic attributes can be controlled by prepending natural language instructions in ChatML format. During training, text–speech pairs are annotated with human-written style descriptions $d$ (e.g., pitch, energy, speaking rate). A second style encoder $E_d$ embeds $d$ as $d_e$ , which is concatenated to the LM context, yielding a form

$\tilde{h}_t = h_t + W_s \cdot s + W_d \cdot d_e$

with $W_s, W_d$ learned projection matrices (form paraphrased from report) (Hu et al., 22 Jan 2026).

This architecture allows fine-grained, compositional manipulation of voice attributes, enabling both novel speaker creation and flexible style-based synthesis.

3. Speech Tokenizers and Waveform Decoding

Qwen3-TTS employs two discrete tokenizers, each trading off semantic abstraction, bitrate, and latency:

Tokenizer Name	Codebook Type / Size	Frame Rate	Bitrate	Latency	Waveform Decoder
Qwen-TTS-Tokenizer-25Hz	Single VQ, $K=32{,}768$	25 Hz	375 bit/s	320 ms block	Block-wise DiT + BigVGAN
Qwen-TTS-Tokenizer-12Hz	RVQ, $N=16$ , $K=2048$ each	12.5 Hz	2,200 bit/s	97 ms	Causal ConvNet

25 Hz Tokenizer: Based on a single-codebook vector quantizer ( $K=32,768$ , 15 bits/token), this codec integrates natively with the Qwen-Audio stack. After streaming token grouping into 8-token (320 ms) blocks, a block-wise Diffusion Transformer with Flow Matching and a BigVGAN vocoder reconstructs the waveform. Look-back/look-ahead buffers of 3/1 blocks are used to balance quality and chunked decoding latency.

12 Hz Tokenizer: A multi-codebook Residual VQ with $N=16$ quantizers ( $K=2048$ , 11 bits/codebook, 176 bits/frame), fully causal encoder/decoder, and streaming-friendly design. Semantic codes are guided by WavLM features, and acoustic details are modeled by a 15-layer RVQ. Waveform packets (4 tokens = 320 ms) are decoded immediately with a causal ConvNet (≈20 layers, GLU activations), enabling first-packet latency as low as 97 ms (Hu et al., 22 Jan 2026, Xu et al., 22 Sep 2025).

In the Qwen3-Omni Talker, each 80 ms frame is encoded as an $M$ -tuple of discrete codes and a dense autoregressive head (“MTP” $\sim$ 80M parameters) predicts remaining codebooks in each step. The ConvNet replaces block-wise diffusion, achieving waveform rendering in a single causal pass per frame with latency improvements over one order of magnitude (Xu et al., 22 Sep 2025).

4. Training Data, Pre-training, and Multilinguality

Qwen3-TTS is trained on over 5 million hours of speech data spanning 10 languages: English, Chinese, German, Italian, Spanish, Portuguese, Japanese, Korean, French, and Russian.

Training pipeline:

Stage S1: General pre-training on all data (text $\leq$ 8,192 tokens).
Stage S2: High-quality fine-tuning on curated subsets to suppress hallucination and boost fidelity.
Stage S3: Long-context extension (up to 32,768 tokens), upsampling long utterances for context robustness.

Augmentations include ASR-robust pre-training (25 Hz tokenizer) and WavLM-based semantic supervision (12 Hz tokenizer). Targeted oversampling methods correct language imbalances, providing strong zero-shot and cross-lingual capabilities (Hu et al., 22 Jan 2026, Xu et al., 22 Sep 2025).

5. Benchmarking and Comparative Evaluation

Qwen3-TTS achieves competitive or state-of-the-art results across a broad spectrum of evaluation settings:

Streaming Latency: Qwen3-TTS-12Hz-0.6B achieves first-packet latency (FP) of 97 ms (TTFP=93 ms, tokenizer decode=4 ms, RTF≈0.29); Qwen3-TTS-25Hz-1.7B, FP=150 ms (TTFP=125 ms, tokenizer decode=25 ms) (Hu et al., 22 Jan 2026).
Zero-Shot Cloning (Seed-TTS, WER $\downarrow$ ): Qwen3-TTS-12Hz-1.7B achieves 0.77 (ZH), 1.24 (EN), outperforming Seed-TTS and matching/exceeding CosyVoice 3 (Hu et al., 22 Jan 2026).
Multilingual Synthesis: Best WER in 6/10 languages (e.g., Chinese: 0.928 vs MiniMax 2.252; SIM: 0.799 vs 0.677), with overall speaker similarity $>$ 0.75 across languages (Hu et al., 22 Jan 2026).
Cross-Lingual Voice Transfer: Substantial WER/CER gains in cross-language scenarios (e.g., ZH $\rightarrow$ KO: 4.82 vs CosyVoice 3 at 14.4) (Hu et al., 22 Jan 2026).
Voice Design/Instructional Synthesis (InstructTTSEval): Qwen3-TTS-12Hz-1.7B-VD sets a new open-source SOTA for Voice Design (ZH: APS 85.2%; EN: APS 82.9%).
Long-Form Streaming: Qwen3-TTS-25Hz-1.7B-CustomVoice yields WERs of 1.517 (long-ZH) and 1.225 (long-EN) for input segments exceeding 10 minutes, outperforming chunk-based or non-streaming baselines by more than 3 $\times$ in some scenarios (Hu et al., 22 Jan 2026).

Comparative studies within Qwen3-Omni confirm that Qwen3-TTS matches or outperforms closed-source models including Seed-TTS, CosyVoice, ElevenLabs, with significantly lower latency and maintained quality across streaming contexts (Xu et al., 22 Sep 2025).

6. Real-Time Deployment and Scaling

Qwen3-TTS is optimized for low-latency, streaming inference in high-concurrency settings. Its streaming workflow involves:

Immediate acoustic token emission in lock-step with incoming text.
Real-time waveform decoding enabled by the causal Code2Wav ConvNet.
Sub-100 ms first-packet latency achievable with 0.6B parameter models; steady-state RTF consistently $<$ 0.5, even under 6 $\times$ concurrent requests (Hu et al., 22 Jan 2026, Xu et al., 22 Sep 2025).

The Qwen3-Omni integration supports asynchronous prefill and chunk-wise generation, preserving consistent end-to-end latencies under concurrent usage (e.g., RTF=0.56 with 4 users, 0.66 with 6) (Xu et al., 22 Sep 2025).

7. Licensing and Open-Source Release

The full Qwen3-TTS model suite—including the Qwen-TTS-Tokenizer-25Hz and Qwen-TTS-Tokenizer-12Hz, pretrained weights, and inference code—is distributed under the Apache 2.0 license, supporting both academic and commercial applications. Source repositories and deployment tools are available at https://github.com/QwenLM/Qwen3-TTS (Hu et al., 22 Jan 2026).

Qwen3-TTS represents a unified approach to controllable, robust, real-time, and highly multilingual TTS synthesis. Its architecture, training pipeline, and empirical performance establish a new baseline for open-source speech synthesis, enabling downstream applications in voice cloning, dialogue agents, cross-lingual content generation, and multimodal integration (Hu et al., 22 Jan 2026, Xu et al., 22 Sep 2025).

Markdown Report Issue Upgrade to Chat

References (2)

Qwen3-TTS Technical Report (2026)

Qwen3-Omni Technical Report (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Qwen3-TTS.