Qwen3-TTS Technical Report
Abstract: In this report, we present the Qwen3-TTS series, a family of advanced multilingual, controllable, robust, and streaming text-to-speech models. Qwen3-TTS supports state-of-the-art 3-second voice cloning and description-based control, allowing both the creation of entirely novel voices and fine-grained manipulation over the output speech. Trained on over 5 million hours of speech data spanning 10 languages, Qwen3-TTS adopts a dual-track LM architecture for real-time synthesis, coupled with two speech tokenizers: 1) Qwen-TTS-Tokenizer-25Hz is a single-codebook codec emphasizing semantic content, which offers seamlessly integration with Qwen-Audio and enables streaming waveform reconstruction via a block-wise DiT. 2) Qwen-TTS-Tokenizer-12Hz achieves extreme bitrate reduction and ultra-low-latency streaming, enabling immediate first-packet emission ($97\,\mathrm{ms}$) through its 12.5 Hz, 16-layer multi-codebook design and a lightweight causal ConvNet. Extensive experiments indicate state-of-the-art performance across diverse objective and subjective benchmark (e.g., TTS multilingual test set, InstructTTSEval, and our long speech test set). To facilitate community research and development, we release both tokenizers and models under the Apache 2.0 license.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
Overview
This paper introduces Qwen3-TTS, a family of text-to-speech (TTS) models that can turn written text into natural-sounding speech in many languages. These models are fast, can start speaking almost immediately, can copy a person’s voice from just 3 seconds of audio, and can follow style instructions like “speak calmly” or “use a cheerful tone.” They were trained on over 5 million hours of speech in 10 languages and are released for the community to use under the Apache 2.0 open-source license.
Key Objectives
The paper focuses on four main goals:
- Make speech sound stable, natural, and human-like across different languages.
- Allow fine-grained control over how the voice sounds (tone, speed, emotion) using simple text instructions.
- Support quick voice cloning from a few seconds of audio and offer preset, high-quality voices.
- Enable real-time “streaming” speech, so the model can start speaking with very low delay and keep speaking smoothly as text arrives.
Methods and Approach
To help explain the technical ideas, here are simple analogies and definitions used in the paper:
- Tokens: Think of tokens as tiny building blocks. For TTS, speech is broken into small units (tokens) so the model can process and rebuild sound step by step.
- Tokenizer: A tool that turns audio into tokens and back again. It’s like a translator that converts sound into symbols and then back to sound.
- Codebook: A dictionary of symbols the model uses to represent sounds.
- Hz (Hertz): Times per second. For example, 12.5 Hz means 12.5 token steps per second.
- Autoregressive model: A “one-step-at-a-time” model that uses what it has already said to decide what to say next.
- Streaming: The model starts speaking quickly and keeps producing audio in small packets, so you hear it almost right away instead of waiting for the whole sentence to be processed.
The Qwen3-TTS system has two different tokenizers and a dual-track architecture:
Two Tokenizers (Ways to Represent Speech)
- Qwen-TTS-Tokenizer-25Hz
- Works at 25 steps per second (25 Hz) with one codebook.
- Mixes both meaning (what is being said) and sound details (how it’s said).
- For rebuilding audio, it uses a “Diffusion Transformer” (DiT) with “Flow Matching” and a vocoder called BigVGAN.
- Streaming is done in chunks: the model looks a bit ahead to keep speech smooth. This gives high quality but needs a little extra wait time before the first sound.
- Qwen-TTS-Tokenizer-12Hz
- Works at 12.5 steps per second with multiple codebooks (layers of detail).
- The first layer focuses on meaning; the other layers add sound texture (pitch, emotion, speaking style).
- Rebuilds audio using a lightweight, fast decoder (a small causal ConvNet), so it can emit the first sound packet very quickly (about 97–101 ms).
- Fully “left-context” streaming means it doesn’t need to peek ahead; it can speak as soon as it has enough tokens.
Dual-Track Architecture (Text + Audio Together)
- The model reads text tokens and, almost at the same time, predicts audio tokens.
- A speaker encoder helps keep the voice identity consistent (so if it’s cloning someone, it sounds like them).
- The “Code2Wav” module turns audio tokens into sound you can hear.
- In the 12Hz models, a hierarchical prediction scheme first predicts the main (semantic) codebook, then fills in the detailed ones with a Multi-Token Prediction module. This keeps voices expressive and consistent while staying fast.
Training Strategy
The training has two big phases:
- Pre-training (three stages):
- Stage 1: Train on 5+ million hours of multilingual speech to learn general TTS skills.
- Stage 2: Continue training on higher-quality data to reduce errors caused by noisy data and improve sound quality.
- Stage 3: Teach the model to handle very long inputs by increasing token length limits and practicing on longer speech.
- Post-training (three stages):
- Use Direct Preference Optimization (DPO) with human feedback to make outputs match what people prefer.
- Add rule-based rewards (GSPO) to improve stability and control.
- Lightweight speaker fine-tuning so the model can adopt specific voices more naturally and expressively.
Features
- 3-second voice cloning, streaming voice cloning, and predefined voices.
- Voice design: describe a new voice in text, and the model creates it (e.g., “soft, warm, slightly husky”).
- Fine-grained control using simple instructions (speed, energy, emotion, pauses).
- Robust streaming under load: stays stable even when many users are listening at the same time.
Main Findings
Here are the key results and why they matter:
- Very low delay: The 12Hz models can start speaking in about 97–101 milliseconds. This feels almost instant.
- Better reconstruction quality: The 12Hz tokenizer sets new records on standard sound quality tests (PESQ, STOI, UTMOS, speaker similarity), while being very efficient.
- Strong zero-shot cloning: Without training on a specific voice, Qwen3-TTS-12Hz-1.7B achieved state-of-the-art word accuracy on English test sets and performed robustly across multiple languages.
- Multilingual performance: Across 10 languages, Qwen3-TTS beats or matches top commercial systems in understanding (low word error) and voice similarity (keeps the same timbre and style).
- Cross-lingual voice transfer: Keeps the same speaker identity across languages with lower error than prior models, especially in challenging pairs like Chinese-to-Korean.
- Controllable speech: On the InstructTTSEval benchmark, Qwen3-TTS follows voice design instructions better than other open-source models and even outperforms some commercial systems in matching the requested style.
- Long speech generation: Produces clear, consistent audio for more than 10 minutes with fewer mistakes than other open-source systems; the 25Hz version is particularly stable for very long texts.
Why this matters:
- Low delay makes conversations with AI feel natural.
- High-quality voice and strong cloning can be used in assistants, audiobooks, games, and content creation.
- Multi-language support helps global users and cross-language applications (like dubbing or multilingual education).
- Fine control lets creators shape exactly how the voice should sound.
Why It Matters and Potential Impact
Qwen3-TTS brings together fast streaming, voice cloning, multilingual support, and style control in one system. This makes it useful for:
- Real-time voice assistants that respond naturally and quickly.
- Education tools and accessibility (e.g., reading assistance for different languages and voices).
- Media production: audiobooks, podcasts, videos, and games with tailored voices.
- Cross-lingual applications: keeping a speaker’s unique voice while speaking different languages (useful in dubbing and global live events).
Because both the models and tokenizers are open-source, researchers and developers can build on them, improve them, and create new tools. The paper suggests future work in adding more languages, more precise style controls, and expanding to broader audio tasks, which could further improve how humans interact with computers through natural voice.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
The following list summarizes what remains missing, uncertain, or unexplored in the paper.
- Data transparency: No disclosure of the 5M-hour corpus sources, per-language hours, licenses, collection methodology, speaker consent, recording conditions, or noise profiles, making reproducibility and ethical auditing difficult.
- Quality stratification pipeline: The “High-Quality Stage (S2)” mentions a dedicated pipeline but lacks concrete criteria, models, thresholds, and measurable effects on data distribution and downstream performance; an ablation is needed.
- Speaker diversity and fairness: Absent demographics (age, gender, accent, sociolect), and no fairness analysis across groups or accents; bias audits and stratified evaluations are missing.
- Reproducibility of training: No training hyperparameters, optimizer details, schedules, curriculum strategies, compute budget, hardware, dataset sampling policies, or random seeds for pre- and post-training.
- Post-training specifics: DPO/GSPO reward designs, preference pair construction protocols, annotator demographics, quality control, and per-language coverage are unspecified; ablations on alignment methods are needed.
- Safety and misuse: No safeguards for unauthorized 3-second voice cloning (e.g., watermarking, consent verification, cloneability thresholds, identity-protection filters), nor an abuse report pipeline.
- Privacy risks: No evaluation of speaker memorization or leakage (membership inference, voice fingerprint extraction) and no privacy-preserving training mechanisms (e.g., differential privacy).
- Human evaluation: Heavy reliance on automatic metrics (WER, PESQ, STOI, UTMOS, SIM); missing multilingual human MOS, prosody/emotion naturalness ratings, and listener studies (with transparent protocols and inter-rater reliability).
- Benchmark comparability: Evaluation protocols (text normalization, punctuation handling, ASR choice, language-specific scoring) are not standardized across systems; unclear if results are directly comparable to baselines (e.g., unusually high WER for some commercial systems).
- Long-speech evaluation bias: Using in-house Qwen3-ASR for transcription may bias WER; third-party ASR and human verification are needed for long-form tests.
- Cross-lingual identity: Cross-lingual evaluations report content error rates only; cross-lingual speaker similarity/timbre preservation metrics (and human judgments) are missing.
- Code-switching: Performance on intra-utterance code-switching and multilingual mix (common in real usage) is not assessed.
- Robustness to noisy references: 3-second voice cloning robustness under background noise, far-field microphones, reverberation, and low-SNR input is unreported.
- Streaming reliability: No evaluation under network jitter, packet loss, buffering delays, or variable server load; latency jitter and audio continuity under adverse conditions are unknown.
- Edge deployment: Memory footprint, quantization strategies, CPU-only/mobile performance, power consumption, and thermal behavior are not measured.
- Lip-sync alignment: No assessment of phoneme-timing precision for audiovisual applications (lip sync accuracy and latency jitter).
- Expressive control granularity: The mapping from textual attributes to acoustic controls (prosody, emotion, speaking rate, emphasis) lacks a formal schema; coverage, compositionality, and conflict resolution are not tested.
- “Thinking pattern” ablation: The probabilistically activated thinking mechanism’s design, triggers, and ablation impacts on instruction following (and failure modes) are not provided.
- Tokenizer trade-offs: The claimed semantic–acoustic balance for 25Hz and 12Hz is not quantified; ablations over codebook size, FPS, RVQ depth, and token rate vs. expressivity/stability are missing.
- MTP module details: Architecture, training objective, error propagation across codebooks, and ablation against alternative multi-codebook predictors are absent; failure cases (e.g., token interference) not analyzed.
- Packet sizing policy: The choice of 4 tokens per packet at 12.5 Hz is heuristic; no study of packet size vs. latency, scheduling overhead, and perceptual continuity, nor adaptive packet strategies.
- Long-horizon stability: Claims of >10-min seamless generation are not supported by measurable prosody drift, repetition, omission, or monotony metrics; no robustness stress tests (e.g., dynamic speaking rate, topic shifts).
- Non-speech/para-linguistic coverage: Handling of laughter, breaths, disfluencies, singing, and emotional extremes is not evaluated; tokenizer fidelity for such events is unknown.
- Complex text robustness: Performance on numbers, dates, URLs, spelling, mathematical expressions, abbreviations, and rare named entities (including multilingual proper nouns) is not reported.
- Language scalability: Method’s behavior on low-resource languages, dialects, and expansion beyond the current 10 languages is not studied; adaptation data requirements and transfer strategies are unclear.
- Interactive integration: Turn-taking, barge-in handling, latency management, and stability in live LLM-driven conversational loops (text–audio–text) are not evaluated.
- Voice uniqueness: For description-based voice creation, there is no metric or protocol to ensure identity uniqueness (avoid collisions) and to quantify similarity/distinctiveness among generated voices.
- Preset voice profiles: The number (“x curated”) and selection criteria of predefined voices, along with coverage/diversity and licensing, are unspecified.
- Adversarial prompts: No analysis of prompt-based attacks (e.g., bypassing control constraints, unauthorized cloning) or guardrail efficacy.
- Tokenizer release details: Interface specifications, on-disk format, codebook mapping, streaming APIs, and backward compatibility guarantees are not documented for community use.
- Stage-1 vs Stage-2 TTS impact: 25Hz tokenizer’s ASR degradation in Stage 2 is noted, but its concrete improvement for TTS (e.g., MOS, prosody) vs Stage 1 is not quantified.
- Sampling and decoding: No details on sampling temperature, top-k/p, or beam strategies for stable/expressive generation; guidance for reproducible inference settings is absent.
- Concurrency scaling: Efficiency is reported up to concurrency 6; scaling beyond that, queuing effects, and tail-latency distributions are not examined.
- Alignment to video frame rates: No study of perceived synchronization when streaming at different FPS/code rates with external media timelines.
- Open-sourcing scope: Models and tokenizers are said to be released, but training data will not be; reproducibility of reported performance without data parity remains an open question.
Practical Applications
Immediate Applications
Below are practical, deployable use cases that leverage Qwen3-TTS’s multilingual, controllable, low-latency streaming TTS, 3-second voice cloning, and description-based voice control. Each item includes sector, potential tools/products/workflows, and key assumptions/dependencies.
- Real-time multilingual voice assistants and IVR systems
- Sector: software, customer support, telecom
- Tools/products/workflows: “Qwen Voice SDK” for chatbots; IVR agents with <150 ms first-packet latency; call-center bots that follow voice-style prompts (e.g., “calm, empathetic”)
- Assumptions/dependencies: GPU/CPU capacity for concurrency; PSTN/SIP integration; consent and disclosure policies for synthetic voice in customer interactions
- Automated localization and dubbing for video, e-learning, and games
- Sector: media/entertainment, education, enterprise training
- Tools/products/workflows: pipeline from script → style prompt → streaming TTS → human-in-the-loop QA; cross-lingual voice transfer preserving timbre for multilingual content
- Assumptions/dependencies: accurate text alignment/subtitle timing; per-language pronunciation QA; rights management for cloned voices
- Voice design studio for brand personas and creator tooling
- Sector: marketing, creator economy
- Tools/products/workflows: description-based voice creation (“warm, trustworthy middle-aged narrator”) and preset voice libraries; batch synthesis for ads and promos
- Assumptions/dependencies: brand safety and approvals; prompt governance and versioning; watermarking or labeling of synthetic audio
- Accessibility: personalized screen readers and content voice-over
- Sector: healthcare, public sector, education
- Tools/products/workflows: customizable voices for screen readers; dyslexia-friendly narration with controllable prosody; multilingual public service announcements
- Assumptions/dependencies: device compatibility (edge vs. cloud); clinical validation for certain populations; safeguards to avoid mispronunciation of critical information
- Live translation with voice-preserving output
- Sector: conferencing, global operations, events
- Tools/products/workflows: speech-to-text → LLM translation → Qwen3-TTS cross-lingual synthesis preserving speaker timbre; real-time meeting assistants
- Assumptions/dependencies: upstream ASR/MT quality; latency budgets for live events; consent and labeling for translated synthetic voices
- Audiobooks and long-form content generation
- Sector: publishing, education
- Tools/products/workflows: stable >10-minute narration with style/pace controls; batch workflows with “voice QA” and error-checking (WER-based)
- Assumptions/dependencies: editorial QA; licensing for cloned narrator voices; consistent prosody across chapters
- On-device or edge TTS for embedded systems
- Sector: robotics, automotive, IoT
- Tools/products/workflows: 0.6B 12Hz variant for low-latency synthesis on edge GPUs; vehicle assistants and home devices with instant speech
- Assumptions/dependencies: hardware acceleration (CUDA/Metal/OpenCL); power/thermal constraints; packet scheduling tuned for edge
- Synthetic voice anonymization for privacy-preserving communications
- Sector: healthcare, social services, journalism
- Tools/products/workflows: replace caller/patient voice with consistent synthetic persona; configurable “non-identifiable” voice thumbnails for recordings
- Assumptions/dependencies: policy and consent workflows; clear labeling; safeguards against re-identification through prosody cues
- Developer tooling for speech UX prototyping
- Sector: software, HCI research
- Tools/products/workflows: programmable prosody/style prompts; A/B testing harness; “prompt-to-voice” unit tests for voice UX
- Assumptions/dependencies: reproducible prompt semantics; version control for voice profiles; integration with CI/CD
- Multilingual compliance and finance communications
- Sector: finance, government, enterprise compliance
- Tools/products/workflows: templated disclosures, statements, and alerts with consistent voice; rapid language rollout via 10-language support
- Assumptions/dependencies: regulatory review of synthetic voice usage; accurate pronunciation of legal terminology; logging/audit trails
- Customer service quality and training simulations
- Sector: HR/L&D, customer support
- Tools/products/workflows: role-play scenarios with controllable voice personas; multilingual simulation of challenging calls
- Assumptions/dependencies: scenario design; guardrails to avoid harmful stereotypes in voice prompts; data privacy for training logs
- Research baselines and reproducible benchmarks
- Sector: academia (speech, NLP, HCI)
- Tools/products/workflows: open-source Apache-2.0 models/tokenizers for studies in prosody control, semantic-acoustic disentanglement, streaming TTS latency; InstructTTSEval replications
- Assumptions/dependencies: availability of evaluation datasets; documented training settings; compute for long-context experiments
- Voice-driven education tools and language learning
- Sector: education
- Tools/products/workflows: tutor voices with adjustable speed, emotion, and accent; cross-lingual practice preserving a familiar tutor’s timbre
- Assumptions/dependencies: pedagogical validation; per-learner customization without bias; curriculum integration
Long-Term Applications
Below are use cases that are promising but require further research, scaling, or development in areas like safety, regulation, on-device optimization, or broader language coverage.
- Universal voice communication layer for real-time global collaboration
- Sector: enterprise collaboration, conferencing
- Tools/products/workflows: seamless cross-lingual voice transfer with near-human prosody, low jitter, and lip-sync for video avatars
- Assumptions/dependencies: tighter ASR/MT integration; end-to-end latency optimization; robust lip-sync and audiovisual alignment
- On-device, fully offline multilingual TTS for consumer hardware
- Sector: mobile, wearables, automotive
- Tools/products/workflows: quantized/distilled variants of Qwen3-TTS-12Hz for smartphones and AR devices; battery-aware synthesis scheduling
- Assumptions/dependencies: model compression and DSP integration; localized language packs; privacy guarantees without cloud reliance
- Regulated synthetic voice identity management and provenance
- Sector: policy, compliance
- Tools/products/workflows: consent registries, provenance attestations, and standardized watermarking for synthetic audio; “voice license” dashboards
- Assumptions/dependencies: industry standards and legal frameworks; interoperable watermarking; public education on synthetic voice labeling
- Safety frameworks for deepfake-resistant voice ecosystems
- Sector: cybersecurity, public safety, finance
- Tools/products/workflows: detection services and “challenge-response” anti-spoofing; API-level safeguards to restrict risky cloning (e.g., high-profile voices)
- Assumptions/dependencies: robust adversarial testing; community norms; integration with KYC/identity verification
- Emotionally aware and therapeutic speech applications
- Sector: healthcare (mental health, speech therapy)
- Tools/products/workflows: fine-grained affect control (“supportive, steady, low arousal”) for therapeutic settings; clinician-in-the-loop tuning
- Assumptions/dependencies: clinical trials and efficacy studies; bias/safety audits for affect prompts; tailored datasets
- Large-scale voice personalization for millions of users
- Sector: consumer platforms, gaming, social
- Tools/products/workflows: per-user voice personas with cloud sync; fast in-context adaptation; moderation pipelines for prompt safety at scale
- Assumptions/dependencies: cost-effective serving under high concurrency; scalable storage of voice profiles; content moderation and governance
- Robotics with situationally adaptive speech
- Sector: robotics, industrial automation
- Tools/products/workflows: context-aware, multilingual speech that adapts to noise and user stress levels; cohesive multi-robot voice coordination
- Assumptions/dependencies: robust environment sensing; multimodal fusion (audio, vision, state); reliability in safety-critical situations
- Broadcast-grade synthetic presenters and dynamic news pipelines
- Sector: media, public sector
- Tools/products/workflows: live, multilingual anchors with style consistency, emergency-tone presets, and audience-specific prosody
- Assumptions/dependencies: editorial standards; real-time fact-checking; strict labeling and provenance
- Educational co-pilots with individualized prosody and pacing
- Sector: education, EdTech
- Tools/products/workflows: adaptive TTS that models student engagement and comprehension; long-context narratives with topic-aware modulation
- Assumptions/dependencies: learning analytics integration; privacy-preserving personalization; longitudinal efficacy studies
- Domain-specific tokenizers and prosody control libraries
- Sector: academia, enterprise R&D
- Tools/products/workflows: specialized codebooks (medical, legal, broadcast) and prosody APIs; multi-token prediction refinements for domain stability
- Assumptions/dependencies: domain corpora and licensing; reproducible training setups; benchmark extensions beyond WER/SIM
- Multimodal, omni-capable audio generation systems
- Sector: software, creative tools
- Tools/products/workflows: unified architecture for speech, sound effects, and music; text-and-audio-conditioned generation for immersive experiences
- Assumptions/dependencies: expanded training modalities; safety guidelines for generative audio; creator-friendly licensing
- Smart city and public service voice infrastructure
- Sector: public sector, energy/utilities
- Tools/products/workflows: multilingual announcements and alerts with controllable urgency; personalized accessibility streams in public spaces
- Assumptions/dependencies: civic policies on synthetic voice use; infrastructure for low-latency delivery; resilience and redundancy planning
- Financial services voice analytics + synthetic response co-pilots
- Sector: finance
- Tools/products/workflows: compliant, multilingual responses with consistent personas; integration with risk analytics to modulate tone for sensitive disclosures
- Assumptions/dependencies: regulatory acceptance; auditability and logs; controlled cloning to prevent fraud
Cross-cutting assumptions and dependencies
- Compute and latency: achieving 97–150 ms first-packet latency depends on optimized runtimes (e.g., torch.compile, CUDA Graphs, vLLM) and hardware acceleration.
- Consent and rights: voice cloning (3-second sample) requires explicit consent, robust identity governance, and clear user controls.
- Safety and labeling: synthetic voice must be labeled and ideally watermarked; deployment policies should mitigate impersonation risks.
- Language coverage and bias: performance depends on training coverage and may vary by language/accent; ongoing evaluation and fine-tuning are needed.
- Integration stack: end-to-end quality relies on upstream ASR/MT and downstream vocoders; streaming packet design and concurrency tuning affect user experience.
- Licensing and openness: Apache 2.0 enables commercial use; productization requires compliance with local regulations and sector-specific standards.
Glossary
- AGI: Artificial General Intelligence; a broad goal of creating AI with human-level capabilities across tasks. "Stable, controllable, and human-like speech synthesis is widely viewed as a key capability on the path to AGI."
- acoustic codebook: A discrete dictionary of tokens that encode low-level acoustic details like timbre and prosody. "a semantic codebook capturing high-level semantic content and an acoustic codebook modeling acoustic detail, prosody, and others."
- autoregressive language modeling: A modeling approach that predicts the next token conditioned on previous tokens. "combined with autoregressive language modeling of discrete units"
- BigVGAN: A neural vocoder architecture for waveform synthesis from spectrograms. "a modified BigVGAN reconstructs the waveform from the generated mel-spectrogram."
- block-wise DiT: Using a Diffusion Transformer in chunked blocks for efficient streaming generation. "enables streaming waveform reconstruction via a block-wise DiT."
- block-wise flow matching: Performing flow-matching-based reconstruction in chunks to support streaming. "with waveform reconstruction via block-wise flow matching to enable streaming synthesis"
- causal ConvNet: A convolutional network restricted to past context, enabling streaming/real-time decoding. "a lightweight causal ConvNet."
- CER: Character Error Rate; a content intelligibility metric for text recognition from speech. "Mixed Error Rate (WER for English, CER for others,"
- ChatML: A markup format for structuring dialog data used to standardize controllable inputs. "All data is formatted in ChatML to standardize inputs and support controllable speech generation."
- chunk-wise inference: Processing or decoding in fixed-size chunks to reduce latency and memory. "Qwen-TTS-Tokenizer-25Hz performs code-to-waveform synthesis through chunk-wise inference."
- Code2Wav: The module that converts predicted audio tokens into time-domain waveforms. "converted into waveforms by the Code2Wav module."
- codec: A learned encoder-decoder that discretizes and reconstructs audio signals. "Qwen-TTS-Tokenizer-25Hz is a single-codebook codec emphasizing semantic content"
- codebook: The finite set of discrete symbols used to quantize continuous speech features. "employs a 25 Hz single-codebook representation"
- continual pre-training (CPT): Further pre-training a model on curated data after initial training. "perform continual pre-training (CPT) with high-quality data."
- Diffusion Transformer (DiT): A transformer architecture trained with diffusion/flow-matching objectives for generative modeling. "we use a Diffusion Transformer (DiT) trained with Flow Matching."
- Direct Preference Optimization (DPO): A post-training method aligning model outputs with human preference comparisons. "we introduce Direct Preference Optimization (DPO)~\citep{rafailov2024direct}"
- discrete speech representations: Tokenized audio units used instead of continuous features for modeling and generation. "we use discrete speech representations as the cornerstone of our architecture"
- Flow Matching: A training objective related to continuous normalizing flows for generative modeling. "The input code sequence is first mapped to a mel-spectrogram via Flow Matching"
- GAN-based framework: Training with a Generator and Discriminator to improve realism of reconstructions. "Training adopts a GAN-based framework in which the generator operates directly on raw waveforms"
- GSPO: A post-training optimization approach (not expanded in the paper) used with rule-based rewards to improve stability and capability. "we employ rule-based rewards and leverage GSPO to comprehensively enhance the model's capabilities and stability across tasks."
- hierarchical prediction scheme: Predicting tokens in stages (e.g., base layer then residual layers) to capture detail efficiently. "It adopts a hierarchical prediction scheme: the backbone ingests aggregated codebook features to predict the zeroth codebook"
- in-context learning: Conditioning on provided examples (e.g., text–speech pairs) to adapt style without gradient updates. "via in-context learning, which better preserves prosody."
- left-context streaming codec decoder: A decoder that uses only past context, enabling immediate, low-latency audio emission. "uses a pure left-context streaming codec decoder"
- look-ahead: Additional future context required by a model for decoding or synthesis. "Due to the look-ahead requirement in the DiT module"
- mel-spectrogram: A time–frequency representation of audio using the mel scale, commonly used in TTS. "reconstructs mel-spectrograms from the audio tokens."
- Mimi: An architecture employing semantic–acoustic disentangled quantization for speech tokenization. "Building on the semanticâacoustic disentangled quantization strategy of the Mimi architecture"
- Multi-Token Prediction (MTP): Predicting multiple tokens (e.g., across codebooks) at once to reduce latency and improve modeling. "incorporates a Multi-Token Prediction (MTP) module to effectively model the multi-codebook sequence"
- multi-codebook tokenizer: A tokenizer that uses several codebooks (e.g., semantic + acoustic) to represent speech at multiple levels. "Qwen-TTS-Tokenizer-12Hz is a 12.5 Hz multi-codebook tokenizer"
- multi-scale mel-spectrogram reconstruction loss: A loss computed at multiple time–frequency scales to improve reconstruction fidelity. "A multi-scale mel-spectrogram reconstruction loss further enforces timeâfrequency consistency."
- PESQ: Perceptual Evaluation of Speech Quality; an objective measure of speech quality. "Acoustic quality is assessed using Short-Time Objective Intelligibility (STOI), Perceptual Evaluation of Speech Quality (PESQ), and UTMOS"
- probabilistically activated thinking pattern: A training-time mechanism to occasionally trigger intermediate reasoning for better instruction following. "we introduce a probabilistically activated thinking pattern during training to improve instruction following"
- prosody: The rhythm, stress, and intonation patterns in speech. "an acoustic codebook modeling acoustic detail, prosody, and others."
- receptive field: The temporal context a model component can attend to when generating outputs. "The DiTâs receptive field is restricted to 4 blocks"
- residual vector quantization (RVQ): A multi-stage quantization method where each stage encodes residual errors from prior stages. "The acoustic path employs a 15-layer residual vector quantization (RVQ) module"
- RTF: Real-time factor; ratio of processing time to audio duration (lower is faster). "The first-packet latency and RTF reported in our table are computed based on the above setup."
- semantic codebook: The codebook layer designed to capture high-level linguistic/semantic content. "a semantic codebook capturing high-level semantic content"
- semantic–acoustic disentangled quantization: A strategy that separates semantic and acoustic information into different token streams. "Building on the semanticâacoustic disentangled quantization strategy of the Mimi architecture"
- Short-Time Objective Intelligibility (STOI): An objective metric estimating speech intelligibility. "Acoustic quality is assessed using Short-Time Objective Intelligibility (STOI), Perceptual Evaluation of Speech Quality (PESQ), and UTMOS"
- SFT: Speaker fine-tuning; adapting a base TTS model to a specific speaker with additional data. "We analyze the generalization performance of our speaker fine-tuned (SFT) model variants"
- sliding-window block attention: An attention scheme that limits tokens to a local block context for streaming efficiency. "we propose a sliding-window block attention mechanism that restricts each token to a limited context."
- speaker embedding: A learned vector representing a speaker’s voice characteristics for conditioning/cloning. "via a speaker embedding, enabling real-time cloning"
- speaker encoder: A model component that learns speaker identity features for conditioning generation. "we jointly train a learnable speaker encoder with the backbone."
- speaker vector extraction: Deriving a fixed-length representation of a speaker’s identity from audio. "eliminating the need for speaker vector extraction or complex diffusion models"
- Streaming Detokenizer: The module/procedure that converts token sequences back to audio in a streaming fashion. "Streaming Detokenizer"
- streaming synthesis: Real-time generation of audio as text tokens arrive, without waiting for full input. "with waveform reconstruction via block-wise flow matching to enable streaming synthesis"
- timbre: The tone color/quality of a voice that distinguishes speakers. "preserving timbre across language barriers"
- torch.compile: A PyTorch optimization utility to speed up model execution. "with optimizations applied via torch.compile and CUDA Graph acceleration"
- TPP: Time per packet; the per-packet decode/generation time used in streaming measurements. "tokenizer decode time for per-packet (TPP)"
- TTFP: Time-to-first packet tokens; latency to produce the first group of tokens for initial audio emission. "LM time-to-first packet tokens (TTFP)"
- vLLM: A high-throughput inference engine for LLMs. "on our internal vLLM engine (vLLM V0 backend)"
- Vector Quantization (VQ): Discretizing continuous features by assigning them to nearest codebook entries. "a vector quantization (VQ) layer inserted at an intermediate position."
- vocoder: A model that converts spectrogram-like features into waveforms. "the BigVGAN vocoder introduces an extra right-context look-ahead (130 ms)."
- waveform reconstruction: The process of generating time-domain audio from intermediate representations (e.g., tokens or spectrograms). "with waveform reconstruction via block-wise flow matching"
- WavLM: A pretrained speech model used as a teacher for semantic alignment. "For the semantic path, WavLM~\citep{wavlm} serves as a teacher"
- WER: Word Error Rate; an ASR-based metric for content accuracy in generated speech. "Performance is measured by Word Error Rate (WER, ), where lower is better."
- UTMOS: A neural estimate of MOS (Mean Opinion Score) for perceived speech quality. "Acoustic quality is assessed using Short-Time Objective Intelligibility (STOI), Perceptual Evaluation of Speech Quality (PESQ), and UTMOS"
Collections
Sign up for free to add this paper to one or more collections.