Qwen3-ASR-1.7B: Open-Source ASR Model
- Qwen3-ASR-1.7B is an open-source automatic speech recognition model that performs end-to-end transcription and language identification across 52 languages.
- It employs a dual-module architecture with an AuT audio encoder and a Qwen3-based decoder for both streaming and offline inference.
- Benchmarking demonstrates competitive accuracy, robustness in noisy conditions, and scalability compared to leading proprietary APIs.
Qwen3-ASR-1.7B is an open-source, large-scale automatic speech recognition (ASR) model developed as part of the Qwen3-ASR family, designed to perform end-to-end ASR and language identification (LID) across 52 languages and dialects. Leveraging the audio and language modeling (LALM) capabilities of its foundation model, Qwen3-Omni, Qwen3-ASR-1.7B realizes state-of-the-art results among open-source systems, offers robust streaming and offline inference, and provides unified architecture for speech understanding and transcription tasks (Shi et al., 29 Jan 2026).
1. Model Architecture
Qwen3-ASR-1.7B comprises a dual submodule structure: an AuT audio encoder (~300M parameters) and a Qwen3-1.7B decoder (~1.7B parameters).
- AuT audio encoder: 12-layer Transformer encoder, , with 16 attention heads (). The encoder processes 128-dimensional Fbank features, sampled at 12.5 Hz, and employs an 8× down-sampling scheme. It incorporates FlashAttention with dynamic window length (1–8 s) to support both streaming and offline inference. The feed-forward inner layer dimension is with GELU activation.
- Qwen3-1.7B decoder: 24-layer Transformer decoder (pretrained as part of Qwen3-Omni), , 16 attention heads, and standard causal self-attention plus cross-attention to encoded representations. A learned projector aligns encoder and decoder feature spaces.
All key computations in the architecture follow standard Transformer formulations:
- Multi-head self-attention:
- Layer normalization:
- Feed-forward sublayer:
- Cross-entropy loss for SFT:
No novel Transformer blocks are introduced beyond the dynamic windowing in FlashAttention.
2. Pretraining and Fine-tuning Pipeline
Qwen3-ASR-1.7B training proceeds in four stages:
- AuT Pretraining: Conducted on approximately 40M hours of pseudo-labeled speech data (predominantly Chinese and English), using an AED-style cross-entropy objective.
- Omni Pretraining: The Qwen3-Omni foundation is pretrained on around 3 trillion tokens encompassing audio, vision, and text, with multi-task objectives targeted at general audio–language understanding.
- ASR Supervised Fine-tuning (SFT): Employs a multilingual dataset containing 30 languages and 22 Chinese dialects. Style-transfer prompts enforce a "language: X<asr_text>" output schema. Data augmentation includes non-speech detection, streaming enhancement, and context biasing techniques.
- ASR Reinforcement Learning (RL): Applies Group Sequence Policy Optimization (GSPO) on 50K utterances, stratified as 35% Chinese/English, 35% other languages, and 30% functional data. RL improves noise robustness, long-form transcription stability, and challenging scenario generalization.
This multi-stage pipeline allows for both extensive language coverage and robustness across realistic ASR use cases.
3. Inference and Decoding
Qwen3-ASR-1.7B utilizes autoregressive greedy decoding for next-token prediction without beam search. Language identification is integrated by prompting the model with the target language identifier, e.g., <|im_start|>assistant language <lang><asr_text>….
For streaming inference, input is chunked into 2-second windows with a 5-token fallback, retaining state over the last 4 chunks. Efficiency metrics (test conditions: single A100 GPU, bfloat16, CUDA Graph, vLLM v0.14.0) are as follows:
| Concurrency | Offline RTF | Offline Throughput (s/s) | Online TTFT_avg (ms) | Online RTF | Online Throughput (s/s) |
|---|---|---|---|---|---|
| 1 | 0.01482 | 67.48 | 102 | 0.01483 | 67.43 |
| 8 | 0.02072 | 386.10 | 224 | 0.02000 | 400.00 |
| 64 | 0.07360 | 869.57 | 1597 | 0.06208 | 1030.93 |
| 128 | 0.13056 | 980.39 | 3392 | 0.10496 | 1219.51 |
RTF: Real-Time Factor. TTFT: Time To First Token.
This suggests the model scales efficiently with increased parallelization, though RTF and TTFT rise at high concurrency.
4. Evaluation and Benchmarking
Qwen3-ASR-1.7B demonstrates competitive performance against commercial APIs (GPT-4o, Gemini-2.5) and open-source models (Whisper-lv3) across standard and internal benchmarks.
Open-source benchmarks (WER/CER):
| Dataset | GPT-4o | Gemini-2.5 | Whisper-lv3 | Qwen3-ASR-1.7B |
|---|---|---|---|---|
| LibriSpeech (clean) | 1.39 | 2.89 | 1.51 | 1.63 |
| LibriSpeech (other) | 3.75 | 3.56 | 3.97 | 3.38 |
| GigaSpeech (en) | 25.50 | 9.37 | 9.76 | 8.45 |
| Fleurs-en | 2.40 | 2.94 | 4.08 | 3.35 |
| WenetSpeech (net) | 15.30 | 14.43 | 9.86 | 4.97 |
| AISHELL-2 | 4.24 | 11.62 | 5.06 | 2.71 |
| Keystone (Cantonese) | 26.87 | 24.71 | 28.79 | 5.10 |
| Wenet-Chuan (hard) | 43.79 | 67.30 | 26.80 | 21.63 |
Internal Robustness Suite (WER):
| Test | GPT-4o | Gemini2.5 | Whisper | Qwen3-0.6B | Qwen3-1.7B |
|---|---|---|---|---|---|
| Accented English | 28.56 | 23.85 | 21.30 | 16.62 | 16.07 |
| Mandarin (elder) | 14.27 | 36.93 | 10.61 | 4.48 | 3.81 |
| Mandarin (noise) | 36.11 | 29.06 | 63.17 | 17.88 | 16.17 |
| Cantonese dialog | 16.05 | 14.98 | 31.04 | 4.80 | 4.12 |
| 22-dialect mix | 45.37 | 47.70 | 44.55 | 18.24 | 15.94 |
Multilingual and LID performance:
- Average WER on Fleurs (12 languages): 4.90%
- MLS (8 languages): 8.55%
- CommonVoice (13 languages): 9.18%
- Language identification accuracy (30 languages): 97.9% (Whisper-lv3: 94.1%).
Singing and Songs (WER):
| Benchmark | Doubao | FunASR | WhisperX | Qwen3-1.7B |
|---|---|---|---|---|
| M4Singer | 7.88 | 7.29 | 13.58 | 5.98 |
| Opencpop | 3.80 | 2.98 | 9.52 | 3.08 |
| PopCS | 8.97 | 9.42 | 13.77 | 8.52 |
| EntireSongs-en | – | – | N/A | 14.60 |
| EntireSongs-zh | – | – | N/A | 13.91 |
Streaming vs. Offline:
| Mode | Libri clean/other | Fleurs-en | Fleurs-zh |
|---|---|---|---|
| Offline | 1.63 / 3.38 | 3.35 | 2.41 |
| Streaming | 1.95 / 4.51 | 4.02 | 2.84 |
The model exhibits SOTA open-source ASR accuracy across Mandarin, Chinese dialects, and singing benchmarks and achieves competitive results with proprietary APIs on real-world, robustness, and low-resource tasks.
5. Strengths and Limitations
Strengths:
- State-of-the-art open-source ASR performance across English, Mandarin, regional dialects, and singing.
- Competitive LID and transcription accuracy compared to leading proprietary APIs.
- Unified streaming and offline inference implemented via dynamic FlashAttention.
- Integrated prompt-based LID.
- Open sourcing under Apache-2.0 facilitates broad adoption and further research.
Limitations:
- Inference RTF (0.015–0.13) at 1.7B scale may hinder use in extremely low-latency, on-device scenarios.
- Multilingual performance degrades on long-tail languages in larger benchmarks (e.g., 30-language Fleurs).
- Decoding does not currently expose beam search, which may limit performance on noisy or ambiguous inputs where n-best rescoring is critical.
- Reproducing the training pipeline—particularly RL and large-scale pretraining—requires substantial computational resources and expertise.
6. Context and Impact
Qwen3-ASR-1.7B places open-source ASR systems in direct competition with the most capable commercial APIs in both accuracy and real-world robustness. Its architecture exemplifies the trend toward LALM-based ASR, broad multilingualism, and strong integration with language identification. The model’s release—including code for inference and finetuning—under a permissive license substantially reduces barriers to deployment in production and research environments. A plausible implication is increased uptake in commercial, academic, and multilingually diverse ASR pipelines, especially where transparent licensing and custom model adaptation are priorities (Shi et al., 29 Jan 2026).