Papers
Topics
Authors
Recent
Search
2000 character limit reached

Qwen3-TTS Technical Report

Published 22 Jan 2026 in cs.SD, cs.CL, and eess.AS | (2601.15621v1)

Abstract: In this report, we present the Qwen3-TTS series, a family of advanced multilingual, controllable, robust, and streaming text-to-speech models. Qwen3-TTS supports state-of-the-art 3-second voice cloning and description-based control, allowing both the creation of entirely novel voices and fine-grained manipulation over the output speech. Trained on over 5 million hours of speech data spanning 10 languages, Qwen3-TTS adopts a dual-track LM architecture for real-time synthesis, coupled with two speech tokenizers: 1) Qwen-TTS-Tokenizer-25Hz is a single-codebook codec emphasizing semantic content, which offers seamlessly integration with Qwen-Audio and enables streaming waveform reconstruction via a block-wise DiT. 2) Qwen-TTS-Tokenizer-12Hz achieves extreme bitrate reduction and ultra-low-latency streaming, enabling immediate first-packet emission ($97\,\mathrm{ms}$) through its 12.5 Hz, 16-layer multi-codebook design and a lightweight causal ConvNet. Extensive experiments indicate state-of-the-art performance across diverse objective and subjective benchmark (e.g., TTS multilingual test set, InstructTTSEval, and our long speech test set). To facilitate community research and development, we release both tokenizers and models under the Apache 2.0 license.

Summary

  • The paper presents a dual-track autoregressive architecture that decouples text and acoustic tokens to enable low-latency, high-quality streaming synthesis.
  • It leverages two innovative tokenizers—one at 25Hz for full-fidelity streaming and another at 12Hz for ultra-low latency via hierarchical codebook assignment.
  • Extensive evaluations show significant improvements in WER, speaker similarity, and long-form consistency, establishing state-of-the-art performance in multilingual TTS.

Qwen3-TTS: Multilingual, Controllable, and Streaming Text-to-Speech at Scale

Introduction and Model Positioning

Qwen3-TTS is a family of advanced multilingual TTS models characterized by controllability, robustness, state-of-the-art voice cloning, and ultra-low-latency streaming synthesis. It is designed for broad coverage across languages and speaker traits, robust instruction-following for fine-grained output control, and seamless integration with LLM-based architectures for large-scale, real-time voice applications. Figure 1

Figure 1: Qwen3-TTS is a multilingual, controllable, robust, and streaming text-to-speech model that supports diverse tasks such as voice cloning, creation, and control with various complex text inputs.

The architecture leverages over 5 million hours of curated speech in 10 languages. Through a dual-track language modeling framework, it decouples text and acoustic processing, supporting immediate inference with low first-packet latency (as low as 97ms97\,\mathrm{ms}). Qwen3-TTS is open-sourced with permissive licensing for research and development purposes.

Tokenization and Discretization

A central design choice in Qwen3-TTS is explicit speech tokenization via two complementary codecs: Figure 2

Figure 2: Overview of Qwen-TTS tokenizers, detailing the architecture and workflow in 25Hz and 12Hz modes.

Qwen-TTS-Tokenizer-25Hz is a single-codebook codec operating at 25 Hz, integrating semantic and acoustic information using a two-stage training pipeline. The encoder is fine-tuned from Qwen2-Audio with ASR supervision; vector quantization is performed at an intermediate layer, and reconstruction is optimized through a mel-spectrogram decoder. A diffusion transformer (DiT) enables block-wise streaming waveform synthesis with efficient context management. This design is well-suited for full-fidelity streaming but has inherent temporal resolution and latency trade-offs due to codebook composition and DiT lookahead.

Qwen-TTS-Tokenizer-12Hz uses a multi-codebook design at 12.5 Hz for extreme bitrate reduction and minimal latency. Semantic and acoustic representations are disentangled through hierarchical codebook assignment (with WavLM as semantic teacher for the first codebook and a 15-layer RVQ for acoustic detail). A GAN-driven adversarial framework sharpens generation fidelity. Crucially, this codec supports full left-context causal streaming, enabling direct synthesis upon token availability—a critical factor in achieving ultra-low streaming latency.

Model Architecture and Training

Figure 3

Figure 3: The overview of Qwen3-TTS including its dual-track LM design and streaming pipeline, with dashed lines showing optional model components.

Qwen3-TTS employs a dual-track autoregressive LM architecture. Text and discrete speech tokens are processed in parallel, with learnable speaker embeddings for identity control. Textual tokens are immediately decoded to acoustic tokens, which are then mapped to waveforms via the Code2Wav module (chunkwise DiT/BigVGAN for the 25Hz codec, lightweight causal ConvNet for 12Hz).

Pretraining is divided into:

  • General Stage (S1): Initial pretraining on 5M+ hours of multilingual speech, aligning multilingual text with speech tokens.
  • High-Quality Stage (S2): Continued pretraining on filtered high-fidelity data to reduce hallucinations and improve output quality.
  • Long-Context Stage (S3): Increasing token length support up to 32,768 to enable long-form speech and complex prompt handling.

Post-training involves direct preference optimization (DPO), rule-based reward learning (GSPO), and lightweight speaker adaptation. ChatML is used for unified, instruction-followable text-speech interface.

Controllability and Streaming Efficiency

Qwen3-TTS provides advanced control primitives:

  • Cloning: Zero-shot voice (3-second reference) or in-context learning with text-speech pair, supporting robust, emotion-preserving cloning.
  • Voice Design: Creation and manipulation via natural language instructions, leveraging prompt engineering for style, prosody, and identity.
  • Streaming: Both codecs enable streaming output; the 12Hz codec achieves immediate first-packet emission with 4-token packets (320ms320 \mathrm{ms} of speech), while the 25Hz variant supports near-real-time output with blockwise chunking.

Under high concurrency, Qwen3-TTS maintains low latency and stable RTFs. The causal-only 12Hz variant is notably efficient for high-demand, low-latency production scenarios.

Evaluation and Numerical Highlights

Speech Tokenizer Evaluation

  • Qwen-TTS-Tokenizer-25Hz matches or outperforms prominent semantic tokenizers in ASR benchmarks (CommonVoice, Fleurs), with lowest WER in both English and Mandarin in the ASR-supervised variant, confirming effective semantic capture.
  • Qwen-TTS-Tokenizer-12Hz sets a new SOTA in speech reconstruction on LibriSpeech (STOI=0.96, UTMOS=4.16, speaker similarity SIM=0.95), with significant improvements over protocols such as SpeechTokenizer, XCodec, XY-Tokenizer, Mimi, and FireredTTS 2 at markedly lower bitrate.

Speech Generation

  • Zero-Shot Generation: Qwen3-TTS-12Hz-1.7B achieves WER of 0.77 on Chinese and 1.24 on English (Seed-TTS test set), significantly outperforming large commercial and open-source models (e.g., CosyVoice 3, MaskGCT).
  • Multilingual and Cross-Lingual: Qwen3-TTS dominates or matches prior SOTA in content intelligibility (lowest WER) in 6/10 target languages and displays the highest speaker similarity in all cases; for difficult cross-lingual setups (e.g., zh-to-ko), Qwen3-TTS produces a 66% error rate reduction over CosyVoice 3.
  • Controllability (Voice Design): Qwen3-TTS outperforms commercial (Hume) and specialized (VoiceSculptor) models in both Description-Speech Consistency (DSD) and Response Precision (RP) on InstructTTSEval.
  • Long-Form Consistency: 25Hz CustomVoice variant achieves lowest WER on 10-min+ long-form outputs (Chinese=1.517, English=1.225), outperforming chunk-based architectures such as Higgs-Audio-v2 and substantially surpassing VibeVoice on Mandarin.
  • Fine-Tuned Target Speaker: After speaker adaptation, Qwen3-TTS outperforms GPT-4o-Audio Preview in 7/10 target languages for WER (e.g., Japanese 3.88 vs. 5.00, Korean 1.74 vs. 2.76).

Implications and Outlook

Qwen3-TTS establishes a scalable, open-source blueprint for unified multilingual, controllable TTS with robust streaming and state-of-the-art voice modeling. Its integration of semantic-aware and acoustic-rich tokenization, along with a dual-track architecture, makes it directly suitable for omni-modal conversational agents, interactive voice assistants, and large-scale generative audio systems.

Contrary to the excessively fine-grained acoustic modeling preferred in classical codec-driven TTS for ASR fidelity, Qwen3-TTS demonstrates that a carefully balanced semantic-acoustic codec yields superior generalization, lower error accumulation, and improved long-form robustness, particularly in cross-lingual and instruction-controlled settings.

Looking forward, the scalable architecture and open-source codebase enable rapid extension to additional languages, more granular prosody and emotion control, and integration with multi-modal LLMs for seamless audio-visual-linguistic interaction. This positions Qwen3-TTS as a core component for multi-function agentic AI systems, bridging the performance and latency gap between research-grade and production-grade TTS solutions.

Conclusion

Qwen3-TTS delivers a robust, versatile, and efficient solution for high-fidelity, low-latency text-to-speech in multilingual, multi-speaker, and controllable settings. With open access, SOTA empirical performance, and strong architectural design choices, it paves the way for future developments in next-generation audio AI, including unified speech-LLMs, closed-loop conversational agents, and universal deployment in real-time, multi-lingual human–computer interaction frameworks (2601.15621).

Paper to Video (Beta)

Whiteboard

Explain it Like I'm 14

Overview

This paper introduces Qwen3-TTS, a family of text-to-speech (TTS) models that can turn written text into natural-sounding speech in many languages. These models are fast, can start speaking almost immediately, can copy a person’s voice from just 3 seconds of audio, and can follow style instructions like “speak calmly” or “use a cheerful tone.” They were trained on over 5 million hours of speech in 10 languages and are released for the community to use under the Apache 2.0 open-source license.

Key Objectives

The paper focuses on four main goals:

  • Make speech sound stable, natural, and human-like across different languages.
  • Allow fine-grained control over how the voice sounds (tone, speed, emotion) using simple text instructions.
  • Support quick voice cloning from a few seconds of audio and offer preset, high-quality voices.
  • Enable real-time “streaming” speech, so the model can start speaking with very low delay and keep speaking smoothly as text arrives.

Methods and Approach

To help explain the technical ideas, here are simple analogies and definitions used in the paper:

  • Tokens: Think of tokens as tiny building blocks. For TTS, speech is broken into small units (tokens) so the model can process and rebuild sound step by step.
  • Tokenizer: A tool that turns audio into tokens and back again. It’s like a translator that converts sound into symbols and then back to sound.
  • Codebook: A dictionary of symbols the model uses to represent sounds.
  • Hz (Hertz): Times per second. For example, 12.5 Hz means 12.5 token steps per second.
  • Autoregressive model: A “one-step-at-a-time” model that uses what it has already said to decide what to say next.
  • Streaming: The model starts speaking quickly and keeps producing audio in small packets, so you hear it almost right away instead of waiting for the whole sentence to be processed.

The Qwen3-TTS system has two different tokenizers and a dual-track architecture:

Two Tokenizers (Ways to Represent Speech)

  1. Qwen-TTS-Tokenizer-25Hz
  • Works at 25 steps per second (25 Hz) with one codebook.
  • Mixes both meaning (what is being said) and sound details (how it’s said).
  • For rebuilding audio, it uses a “Diffusion Transformer” (DiT) with “Flow Matching” and a vocoder called BigVGAN.
  • Streaming is done in chunks: the model looks a bit ahead to keep speech smooth. This gives high quality but needs a little extra wait time before the first sound.
  1. Qwen-TTS-Tokenizer-12Hz
  • Works at 12.5 steps per second with multiple codebooks (layers of detail).
  • The first layer focuses on meaning; the other layers add sound texture (pitch, emotion, speaking style).
  • Rebuilds audio using a lightweight, fast decoder (a small causal ConvNet), so it can emit the first sound packet very quickly (about 97–101 ms).
  • Fully “left-context” streaming means it doesn’t need to peek ahead; it can speak as soon as it has enough tokens.

Dual-Track Architecture (Text + Audio Together)

  • The model reads text tokens and, almost at the same time, predicts audio tokens.
  • A speaker encoder helps keep the voice identity consistent (so if it’s cloning someone, it sounds like them).
  • The “Code2Wav” module turns audio tokens into sound you can hear.
  • In the 12Hz models, a hierarchical prediction scheme first predicts the main (semantic) codebook, then fills in the detailed ones with a Multi-Token Prediction module. This keeps voices expressive and consistent while staying fast.

Training Strategy

The training has two big phases:

  • Pre-training (three stages):
    • Stage 1: Train on 5+ million hours of multilingual speech to learn general TTS skills.
    • Stage 2: Continue training on higher-quality data to reduce errors caused by noisy data and improve sound quality.
    • Stage 3: Teach the model to handle very long inputs by increasing token length limits and practicing on longer speech.
  • Post-training (three stages):
    • Use Direct Preference Optimization (DPO) with human feedback to make outputs match what people prefer.
    • Add rule-based rewards (GSPO) to improve stability and control.
    • Lightweight speaker fine-tuning so the model can adopt specific voices more naturally and expressively.

Features

  • 3-second voice cloning, streaming voice cloning, and predefined voices.
  • Voice design: describe a new voice in text, and the model creates it (e.g., “soft, warm, slightly husky”).
  • Fine-grained control using simple instructions (speed, energy, emotion, pauses).
  • Robust streaming under load: stays stable even when many users are listening at the same time.

Main Findings

Here are the key results and why they matter:

  • Very low delay: The 12Hz models can start speaking in about 97–101 milliseconds. This feels almost instant.
  • Better reconstruction quality: The 12Hz tokenizer sets new records on standard sound quality tests (PESQ, STOI, UTMOS, speaker similarity), while being very efficient.
  • Strong zero-shot cloning: Without training on a specific voice, Qwen3-TTS-12Hz-1.7B achieved state-of-the-art word accuracy on English test sets and performed robustly across multiple languages.
  • Multilingual performance: Across 10 languages, Qwen3-TTS beats or matches top commercial systems in understanding (low word error) and voice similarity (keeps the same timbre and style).
  • Cross-lingual voice transfer: Keeps the same speaker identity across languages with lower error than prior models, especially in challenging pairs like Chinese-to-Korean.
  • Controllable speech: On the InstructTTSEval benchmark, Qwen3-TTS follows voice design instructions better than other open-source models and even outperforms some commercial systems in matching the requested style.
  • Long speech generation: Produces clear, consistent audio for more than 10 minutes with fewer mistakes than other open-source systems; the 25Hz version is particularly stable for very long texts.

Why this matters:

  • Low delay makes conversations with AI feel natural.
  • High-quality voice and strong cloning can be used in assistants, audiobooks, games, and content creation.
  • Multi-language support helps global users and cross-language applications (like dubbing or multilingual education).
  • Fine control lets creators shape exactly how the voice should sound.

Why It Matters and Potential Impact

Qwen3-TTS brings together fast streaming, voice cloning, multilingual support, and style control in one system. This makes it useful for:

  • Real-time voice assistants that respond naturally and quickly.
  • Education tools and accessibility (e.g., reading assistance for different languages and voices).
  • Media production: audiobooks, podcasts, videos, and games with tailored voices.
  • Cross-lingual applications: keeping a speaker’s unique voice while speaking different languages (useful in dubbing and global live events).

Because both the models and tokenizers are open-source, researchers and developers can build on them, improve them, and create new tools. The paper suggests future work in adding more languages, more precise style controls, and expanding to broader audio tasks, which could further improve how humans interact with computers through natural voice.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

The following list summarizes what remains missing, uncertain, or unexplored in the paper.

  • Data transparency: No disclosure of the 5M-hour corpus sources, per-language hours, licenses, collection methodology, speaker consent, recording conditions, or noise profiles, making reproducibility and ethical auditing difficult.
  • Quality stratification pipeline: The “High-Quality Stage (S2)” mentions a dedicated pipeline but lacks concrete criteria, models, thresholds, and measurable effects on data distribution and downstream performance; an ablation is needed.
  • Speaker diversity and fairness: Absent demographics (age, gender, accent, sociolect), and no fairness analysis across groups or accents; bias audits and stratified evaluations are missing.
  • Reproducibility of training: No training hyperparameters, optimizer details, schedules, curriculum strategies, compute budget, hardware, dataset sampling policies, or random seeds for pre- and post-training.
  • Post-training specifics: DPO/GSPO reward designs, preference pair construction protocols, annotator demographics, quality control, and per-language coverage are unspecified; ablations on alignment methods are needed.
  • Safety and misuse: No safeguards for unauthorized 3-second voice cloning (e.g., watermarking, consent verification, cloneability thresholds, identity-protection filters), nor an abuse report pipeline.
  • Privacy risks: No evaluation of speaker memorization or leakage (membership inference, voice fingerprint extraction) and no privacy-preserving training mechanisms (e.g., differential privacy).
  • Human evaluation: Heavy reliance on automatic metrics (WER, PESQ, STOI, UTMOS, SIM); missing multilingual human MOS, prosody/emotion naturalness ratings, and listener studies (with transparent protocols and inter-rater reliability).
  • Benchmark comparability: Evaluation protocols (text normalization, punctuation handling, ASR choice, language-specific scoring) are not standardized across systems; unclear if results are directly comparable to baselines (e.g., unusually high WER for some commercial systems).
  • Long-speech evaluation bias: Using in-house Qwen3-ASR for transcription may bias WER; third-party ASR and human verification are needed for long-form tests.
  • Cross-lingual identity: Cross-lingual evaluations report content error rates only; cross-lingual speaker similarity/timbre preservation metrics (and human judgments) are missing.
  • Code-switching: Performance on intra-utterance code-switching and multilingual mix (common in real usage) is not assessed.
  • Robustness to noisy references: 3-second voice cloning robustness under background noise, far-field microphones, reverberation, and low-SNR input is unreported.
  • Streaming reliability: No evaluation under network jitter, packet loss, buffering delays, or variable server load; latency jitter and audio continuity under adverse conditions are unknown.
  • Edge deployment: Memory footprint, quantization strategies, CPU-only/mobile performance, power consumption, and thermal behavior are not measured.
  • Lip-sync alignment: No assessment of phoneme-timing precision for audiovisual applications (lip sync accuracy and latency jitter).
  • Expressive control granularity: The mapping from textual attributes to acoustic controls (prosody, emotion, speaking rate, emphasis) lacks a formal schema; coverage, compositionality, and conflict resolution are not tested.
  • “Thinking pattern” ablation: The probabilistically activated thinking mechanism’s design, triggers, and ablation impacts on instruction following (and failure modes) are not provided.
  • Tokenizer trade-offs: The claimed semantic–acoustic balance for 25Hz and 12Hz is not quantified; ablations over codebook size, FPS, RVQ depth, and token rate vs. expressivity/stability are missing.
  • MTP module details: Architecture, training objective, error propagation across codebooks, and ablation against alternative multi-codebook predictors are absent; failure cases (e.g., token interference) not analyzed.
  • Packet sizing policy: The choice of 4 tokens per packet at 12.5 Hz is heuristic; no study of packet size vs. latency, scheduling overhead, and perceptual continuity, nor adaptive packet strategies.
  • Long-horizon stability: Claims of >10-min seamless generation are not supported by measurable prosody drift, repetition, omission, or monotony metrics; no robustness stress tests (e.g., dynamic speaking rate, topic shifts).
  • Non-speech/para-linguistic coverage: Handling of laughter, breaths, disfluencies, singing, and emotional extremes is not evaluated; tokenizer fidelity for such events is unknown.
  • Complex text robustness: Performance on numbers, dates, URLs, spelling, mathematical expressions, abbreviations, and rare named entities (including multilingual proper nouns) is not reported.
  • Language scalability: Method’s behavior on low-resource languages, dialects, and expansion beyond the current 10 languages is not studied; adaptation data requirements and transfer strategies are unclear.
  • Interactive integration: Turn-taking, barge-in handling, latency management, and stability in live LLM-driven conversational loops (text–audio–text) are not evaluated.
  • Voice uniqueness: For description-based voice creation, there is no metric or protocol to ensure identity uniqueness (avoid collisions) and to quantify similarity/distinctiveness among generated voices.
  • Preset voice profiles: The number (“x curated”) and selection criteria of predefined voices, along with coverage/diversity and licensing, are unspecified.
  • Adversarial prompts: No analysis of prompt-based attacks (e.g., bypassing control constraints, unauthorized cloning) or guardrail efficacy.
  • Tokenizer release details: Interface specifications, on-disk format, codebook mapping, streaming APIs, and backward compatibility guarantees are not documented for community use.
  • Stage-1 vs Stage-2 TTS impact: 25Hz tokenizer’s ASR degradation in Stage 2 is noted, but its concrete improvement for TTS (e.g., MOS, prosody) vs Stage 1 is not quantified.
  • Sampling and decoding: No details on sampling temperature, top-k/p, or beam strategies for stable/expressive generation; guidance for reproducible inference settings is absent.
  • Concurrency scaling: Efficiency is reported up to concurrency 6; scaling beyond that, queuing effects, and tail-latency distributions are not examined.
  • Alignment to video frame rates: No study of perceived synchronization when streaming at different FPS/code rates with external media timelines.
  • Open-sourcing scope: Models and tokenizers are said to be released, but training data will not be; reproducibility of reported performance without data parity remains an open question.

Practical Applications

Immediate Applications

Below are practical, deployable use cases that leverage Qwen3-TTS’s multilingual, controllable, low-latency streaming TTS, 3-second voice cloning, and description-based voice control. Each item includes sector, potential tools/products/workflows, and key assumptions/dependencies.

  • Real-time multilingual voice assistants and IVR systems
    • Sector: software, customer support, telecom
    • Tools/products/workflows: “Qwen Voice SDK” for chatbots; IVR agents with <150 ms first-packet latency; call-center bots that follow voice-style prompts (e.g., “calm, empathetic”)
    • Assumptions/dependencies: GPU/CPU capacity for concurrency; PSTN/SIP integration; consent and disclosure policies for synthetic voice in customer interactions
  • Automated localization and dubbing for video, e-learning, and games
    • Sector: media/entertainment, education, enterprise training
    • Tools/products/workflows: pipeline from script → style prompt → streaming TTS → human-in-the-loop QA; cross-lingual voice transfer preserving timbre for multilingual content
    • Assumptions/dependencies: accurate text alignment/subtitle timing; per-language pronunciation QA; rights management for cloned voices
  • Voice design studio for brand personas and creator tooling
    • Sector: marketing, creator economy
    • Tools/products/workflows: description-based voice creation (“warm, trustworthy middle-aged narrator”) and preset voice libraries; batch synthesis for ads and promos
    • Assumptions/dependencies: brand safety and approvals; prompt governance and versioning; watermarking or labeling of synthetic audio
  • Accessibility: personalized screen readers and content voice-over
    • Sector: healthcare, public sector, education
    • Tools/products/workflows: customizable voices for screen readers; dyslexia-friendly narration with controllable prosody; multilingual public service announcements
    • Assumptions/dependencies: device compatibility (edge vs. cloud); clinical validation for certain populations; safeguards to avoid mispronunciation of critical information
  • Live translation with voice-preserving output
    • Sector: conferencing, global operations, events
    • Tools/products/workflows: speech-to-text → LLM translation → Qwen3-TTS cross-lingual synthesis preserving speaker timbre; real-time meeting assistants
    • Assumptions/dependencies: upstream ASR/MT quality; latency budgets for live events; consent and labeling for translated synthetic voices
  • Audiobooks and long-form content generation
    • Sector: publishing, education
    • Tools/products/workflows: stable >10-minute narration with style/pace controls; batch workflows with “voice QA” and error-checking (WER-based)
    • Assumptions/dependencies: editorial QA; licensing for cloned narrator voices; consistent prosody across chapters
  • On-device or edge TTS for embedded systems
    • Sector: robotics, automotive, IoT
    • Tools/products/workflows: 0.6B 12Hz variant for low-latency synthesis on edge GPUs; vehicle assistants and home devices with instant speech
    • Assumptions/dependencies: hardware acceleration (CUDA/Metal/OpenCL); power/thermal constraints; packet scheduling tuned for edge
  • Synthetic voice anonymization for privacy-preserving communications
    • Sector: healthcare, social services, journalism
    • Tools/products/workflows: replace caller/patient voice with consistent synthetic persona; configurable “non-identifiable” voice thumbnails for recordings
    • Assumptions/dependencies: policy and consent workflows; clear labeling; safeguards against re-identification through prosody cues
  • Developer tooling for speech UX prototyping
    • Sector: software, HCI research
    • Tools/products/workflows: programmable prosody/style prompts; A/B testing harness; “prompt-to-voice” unit tests for voice UX
    • Assumptions/dependencies: reproducible prompt semantics; version control for voice profiles; integration with CI/CD
  • Multilingual compliance and finance communications
    • Sector: finance, government, enterprise compliance
    • Tools/products/workflows: templated disclosures, statements, and alerts with consistent voice; rapid language rollout via 10-language support
    • Assumptions/dependencies: regulatory review of synthetic voice usage; accurate pronunciation of legal terminology; logging/audit trails
  • Customer service quality and training simulations
    • Sector: HR/L&D, customer support
    • Tools/products/workflows: role-play scenarios with controllable voice personas; multilingual simulation of challenging calls
    • Assumptions/dependencies: scenario design; guardrails to avoid harmful stereotypes in voice prompts; data privacy for training logs
  • Research baselines and reproducible benchmarks
    • Sector: academia (speech, NLP, HCI)
    • Tools/products/workflows: open-source Apache-2.0 models/tokenizers for studies in prosody control, semantic-acoustic disentanglement, streaming TTS latency; InstructTTSEval replications
    • Assumptions/dependencies: availability of evaluation datasets; documented training settings; compute for long-context experiments
  • Voice-driven education tools and language learning
    • Sector: education
    • Tools/products/workflows: tutor voices with adjustable speed, emotion, and accent; cross-lingual practice preserving a familiar tutor’s timbre
    • Assumptions/dependencies: pedagogical validation; per-learner customization without bias; curriculum integration

Long-Term Applications

Below are use cases that are promising but require further research, scaling, or development in areas like safety, regulation, on-device optimization, or broader language coverage.

  • Universal voice communication layer for real-time global collaboration
    • Sector: enterprise collaboration, conferencing
    • Tools/products/workflows: seamless cross-lingual voice transfer with near-human prosody, low jitter, and lip-sync for video avatars
    • Assumptions/dependencies: tighter ASR/MT integration; end-to-end latency optimization; robust lip-sync and audiovisual alignment
  • On-device, fully offline multilingual TTS for consumer hardware
    • Sector: mobile, wearables, automotive
    • Tools/products/workflows: quantized/distilled variants of Qwen3-TTS-12Hz for smartphones and AR devices; battery-aware synthesis scheduling
    • Assumptions/dependencies: model compression and DSP integration; localized language packs; privacy guarantees without cloud reliance
  • Regulated synthetic voice identity management and provenance
    • Sector: policy, compliance
    • Tools/products/workflows: consent registries, provenance attestations, and standardized watermarking for synthetic audio; “voice license” dashboards
    • Assumptions/dependencies: industry standards and legal frameworks; interoperable watermarking; public education on synthetic voice labeling
  • Safety frameworks for deepfake-resistant voice ecosystems
    • Sector: cybersecurity, public safety, finance
    • Tools/products/workflows: detection services and “challenge-response” anti-spoofing; API-level safeguards to restrict risky cloning (e.g., high-profile voices)
    • Assumptions/dependencies: robust adversarial testing; community norms; integration with KYC/identity verification
  • Emotionally aware and therapeutic speech applications
    • Sector: healthcare (mental health, speech therapy)
    • Tools/products/workflows: fine-grained affect control (“supportive, steady, low arousal”) for therapeutic settings; clinician-in-the-loop tuning
    • Assumptions/dependencies: clinical trials and efficacy studies; bias/safety audits for affect prompts; tailored datasets
  • Large-scale voice personalization for millions of users
    • Sector: consumer platforms, gaming, social
    • Tools/products/workflows: per-user voice personas with cloud sync; fast in-context adaptation; moderation pipelines for prompt safety at scale
    • Assumptions/dependencies: cost-effective serving under high concurrency; scalable storage of voice profiles; content moderation and governance
  • Robotics with situationally adaptive speech
    • Sector: robotics, industrial automation
    • Tools/products/workflows: context-aware, multilingual speech that adapts to noise and user stress levels; cohesive multi-robot voice coordination
    • Assumptions/dependencies: robust environment sensing; multimodal fusion (audio, vision, state); reliability in safety-critical situations
  • Broadcast-grade synthetic presenters and dynamic news pipelines
    • Sector: media, public sector
    • Tools/products/workflows: live, multilingual anchors with style consistency, emergency-tone presets, and audience-specific prosody
    • Assumptions/dependencies: editorial standards; real-time fact-checking; strict labeling and provenance
  • Educational co-pilots with individualized prosody and pacing
    • Sector: education, EdTech
    • Tools/products/workflows: adaptive TTS that models student engagement and comprehension; long-context narratives with topic-aware modulation
    • Assumptions/dependencies: learning analytics integration; privacy-preserving personalization; longitudinal efficacy studies
  • Domain-specific tokenizers and prosody control libraries
    • Sector: academia, enterprise R&D
    • Tools/products/workflows: specialized codebooks (medical, legal, broadcast) and prosody APIs; multi-token prediction refinements for domain stability
    • Assumptions/dependencies: domain corpora and licensing; reproducible training setups; benchmark extensions beyond WER/SIM
  • Multimodal, omni-capable audio generation systems
    • Sector: software, creative tools
    • Tools/products/workflows: unified architecture for speech, sound effects, and music; text-and-audio-conditioned generation for immersive experiences
    • Assumptions/dependencies: expanded training modalities; safety guidelines for generative audio; creator-friendly licensing
  • Smart city and public service voice infrastructure
    • Sector: public sector, energy/utilities
    • Tools/products/workflows: multilingual announcements and alerts with controllable urgency; personalized accessibility streams in public spaces
    • Assumptions/dependencies: civic policies on synthetic voice use; infrastructure for low-latency delivery; resilience and redundancy planning
  • Financial services voice analytics + synthetic response co-pilots
    • Sector: finance
    • Tools/products/workflows: compliant, multilingual responses with consistent personas; integration with risk analytics to modulate tone for sensitive disclosures
    • Assumptions/dependencies: regulatory acceptance; auditability and logs; controlled cloning to prevent fraud

Cross-cutting assumptions and dependencies

  • Compute and latency: achieving 97–150 ms first-packet latency depends on optimized runtimes (e.g., torch.compile, CUDA Graphs, vLLM) and hardware acceleration.
  • Consent and rights: voice cloning (3-second sample) requires explicit consent, robust identity governance, and clear user controls.
  • Safety and labeling: synthetic voice must be labeled and ideally watermarked; deployment policies should mitigate impersonation risks.
  • Language coverage and bias: performance depends on training coverage and may vary by language/accent; ongoing evaluation and fine-tuning are needed.
  • Integration stack: end-to-end quality relies on upstream ASR/MT and downstream vocoders; streaming packet design and concurrency tuning affect user experience.
  • Licensing and openness: Apache 2.0 enables commercial use; productization requires compliance with local regulations and sector-specific standards.

Glossary

  • AGI: Artificial General Intelligence; a broad goal of creating AI with human-level capabilities across tasks. "Stable, controllable, and human-like speech synthesis is widely viewed as a key capability on the path to AGI."
  • acoustic codebook: A discrete dictionary of tokens that encode low-level acoustic details like timbre and prosody. "a semantic codebook capturing high-level semantic content and an acoustic codebook modeling acoustic detail, prosody, and others."
  • autoregressive language modeling: A modeling approach that predicts the next token conditioned on previous tokens. "combined with autoregressive language modeling of discrete units"
  • BigVGAN: A neural vocoder architecture for waveform synthesis from spectrograms. "a modified BigVGAN reconstructs the waveform from the generated mel-spectrogram."
  • block-wise DiT: Using a Diffusion Transformer in chunked blocks for efficient streaming generation. "enables streaming waveform reconstruction via a block-wise DiT."
  • block-wise flow matching: Performing flow-matching-based reconstruction in chunks to support streaming. "with waveform reconstruction via block-wise flow matching to enable streaming synthesis"
  • causal ConvNet: A convolutional network restricted to past context, enabling streaming/real-time decoding. "a lightweight causal ConvNet."
  • CER: Character Error Rate; a content intelligibility metric for text recognition from speech. "Mixed Error Rate (WER for English, CER for others,"
  • ChatML: A markup format for structuring dialog data used to standardize controllable inputs. "All data is formatted in ChatML to standardize inputs and support controllable speech generation."
  • chunk-wise inference: Processing or decoding in fixed-size chunks to reduce latency and memory. "Qwen-TTS-Tokenizer-25Hz performs code-to-waveform synthesis through chunk-wise inference."
  • Code2Wav: The module that converts predicted audio tokens into time-domain waveforms. "converted into waveforms by the Code2Wav module."
  • codec: A learned encoder-decoder that discretizes and reconstructs audio signals. "Qwen-TTS-Tokenizer-25Hz is a single-codebook codec emphasizing semantic content"
  • codebook: The finite set of discrete symbols used to quantize continuous speech features. "employs a 25 Hz single-codebook representation"
  • continual pre-training (CPT): Further pre-training a model on curated data after initial training. "perform continual pre-training (CPT) with high-quality data."
  • Diffusion Transformer (DiT): A transformer architecture trained with diffusion/flow-matching objectives for generative modeling. "we use a Diffusion Transformer (DiT) trained with Flow Matching."
  • Direct Preference Optimization (DPO): A post-training method aligning model outputs with human preference comparisons. "we introduce Direct Preference Optimization (DPO)~\citep{rafailov2024direct}"
  • discrete speech representations: Tokenized audio units used instead of continuous features for modeling and generation. "we use discrete speech representations as the cornerstone of our architecture"
  • Flow Matching: A training objective related to continuous normalizing flows for generative modeling. "The input code sequence is first mapped to a mel-spectrogram via Flow Matching"
  • GAN-based framework: Training with a Generator and Discriminator to improve realism of reconstructions. "Training adopts a GAN-based framework in which the generator operates directly on raw waveforms"
  • GSPO: A post-training optimization approach (not expanded in the paper) used with rule-based rewards to improve stability and capability. "we employ rule-based rewards and leverage GSPO to comprehensively enhance the model's capabilities and stability across tasks."
  • hierarchical prediction scheme: Predicting tokens in stages (e.g., base layer then residual layers) to capture detail efficiently. "It adopts a hierarchical prediction scheme: the backbone ingests aggregated codebook features to predict the zeroth codebook"
  • in-context learning: Conditioning on provided examples (e.g., text–speech pairs) to adapt style without gradient updates. "via in-context learning, which better preserves prosody."
  • left-context streaming codec decoder: A decoder that uses only past context, enabling immediate, low-latency audio emission. "uses a pure left-context streaming codec decoder"
  • look-ahead: Additional future context required by a model for decoding or synthesis. "Due to the look-ahead requirement in the DiT module"
  • mel-spectrogram: A time–frequency representation of audio using the mel scale, commonly used in TTS. "reconstructs mel-spectrograms from the audio tokens."
  • Mimi: An architecture employing semantic–acoustic disentangled quantization for speech tokenization. "Building on the semantic–acoustic disentangled quantization strategy of the Mimi architecture"
  • Multi-Token Prediction (MTP): Predicting multiple tokens (e.g., across codebooks) at once to reduce latency and improve modeling. "incorporates a Multi-Token Prediction (MTP) module to effectively model the multi-codebook sequence"
  • multi-codebook tokenizer: A tokenizer that uses several codebooks (e.g., semantic + acoustic) to represent speech at multiple levels. "Qwen-TTS-Tokenizer-12Hz is a 12.5 Hz multi-codebook tokenizer"
  • multi-scale mel-spectrogram reconstruction loss: A loss computed at multiple time–frequency scales to improve reconstruction fidelity. "A multi-scale mel-spectrogram reconstruction loss further enforces time–frequency consistency."
  • PESQ: Perceptual Evaluation of Speech Quality; an objective measure of speech quality. "Acoustic quality is assessed using Short-Time Objective Intelligibility (STOI), Perceptual Evaluation of Speech Quality (PESQ), and UTMOS"
  • probabilistically activated thinking pattern: A training-time mechanism to occasionally trigger intermediate reasoning for better instruction following. "we introduce a probabilistically activated thinking pattern during training to improve instruction following"
  • prosody: The rhythm, stress, and intonation patterns in speech. "an acoustic codebook modeling acoustic detail, prosody, and others."
  • receptive field: The temporal context a model component can attend to when generating outputs. "The DiT’s receptive field is restricted to 4 blocks"
  • residual vector quantization (RVQ): A multi-stage quantization method where each stage encodes residual errors from prior stages. "The acoustic path employs a 15-layer residual vector quantization (RVQ) module"
  • RTF: Real-time factor; ratio of processing time to audio duration (lower is faster). "The first-packet latency and RTF reported in our table are computed based on the above setup."
  • semantic codebook: The codebook layer designed to capture high-level linguistic/semantic content. "a semantic codebook capturing high-level semantic content"
  • semantic–acoustic disentangled quantization: A strategy that separates semantic and acoustic information into different token streams. "Building on the semantic–acoustic disentangled quantization strategy of the Mimi architecture"
  • Short-Time Objective Intelligibility (STOI): An objective metric estimating speech intelligibility. "Acoustic quality is assessed using Short-Time Objective Intelligibility (STOI), Perceptual Evaluation of Speech Quality (PESQ), and UTMOS"
  • SFT: Speaker fine-tuning; adapting a base TTS model to a specific speaker with additional data. "We analyze the generalization performance of our speaker fine-tuned (SFT) model variants"
  • sliding-window block attention: An attention scheme that limits tokens to a local block context for streaming efficiency. "we propose a sliding-window block attention mechanism that restricts each token to a limited context."
  • speaker embedding: A learned vector representing a speaker’s voice characteristics for conditioning/cloning. "via a speaker embedding, enabling real-time cloning"
  • speaker encoder: A model component that learns speaker identity features for conditioning generation. "we jointly train a learnable speaker encoder with the backbone."
  • speaker vector extraction: Deriving a fixed-length representation of a speaker’s identity from audio. "eliminating the need for speaker vector extraction or complex diffusion models"
  • Streaming Detokenizer: The module/procedure that converts token sequences back to audio in a streaming fashion. "Streaming Detokenizer"
  • streaming synthesis: Real-time generation of audio as text tokens arrive, without waiting for full input. "with waveform reconstruction via block-wise flow matching to enable streaming synthesis"
  • timbre: The tone color/quality of a voice that distinguishes speakers. "preserving timbre across language barriers"
  • torch.compile: A PyTorch optimization utility to speed up model execution. "with optimizations applied via torch.compile and CUDA Graph acceleration"
  • TPP: Time per packet; the per-packet decode/generation time used in streaming measurements. "tokenizer decode time for per-packet (TPP)"
  • TTFP: Time-to-first packet tokens; latency to produce the first group of tokens for initial audio emission. "LM time-to-first packet tokens (TTFP)"
  • vLLM: A high-throughput inference engine for LLMs. "on our internal vLLM engine (vLLM V0 backend)"
  • Vector Quantization (VQ): Discretizing continuous features by assigning them to nearest codebook entries. "a vector quantization (VQ) layer inserted at an intermediate position."
  • vocoder: A model that converts spectrogram-like features into waveforms. "the BigVGAN vocoder introduces an extra right-context look-ahead (130 ms)."
  • waveform reconstruction: The process of generating time-domain audio from intermediate representations (e.g., tokens or spectrograms). "with waveform reconstruction via block-wise flow matching"
  • WavLM: A pretrained speech model used as a teacher for semantic alignment. "For the semantic path, WavLM~\citep{wavlm} serves as a teacher"
  • WER: Word Error Rate; an ASR-based metric for content accuracy in generated speech. "Performance is measured by Word Error Rate (WER, \downarrow), where lower is better."
  • UTMOS: A neural estimate of MOS (Mean Opinion Score) for perceived speech quality. "Acoustic quality is assessed using Short-Time Objective Intelligibility (STOI), Perceptual Evaluation of Speech Quality (PESQ), and UTMOS"

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 8 tweets with 511 likes about this paper.