Papers
Topics
Authors
Recent
Search
2000 character limit reached

Qwen3-ASR-1.7B: Open-Source ASR Model

Updated 31 January 2026
  • Qwen3-ASR-1.7B is an open-source automatic speech recognition model that performs end-to-end transcription and language identification across 52 languages.
  • It employs a dual-module architecture with an AuT audio encoder and a Qwen3-based decoder for both streaming and offline inference.
  • Benchmarking demonstrates competitive accuracy, robustness in noisy conditions, and scalability compared to leading proprietary APIs.

Qwen3-ASR-1.7B is an open-source, large-scale automatic speech recognition (ASR) model developed as part of the Qwen3-ASR family, designed to perform end-to-end ASR and language identification (LID) across 52 languages and dialects. Leveraging the audio and language modeling (LALM) capabilities of its foundation model, Qwen3-Omni, Qwen3-ASR-1.7B realizes state-of-the-art results among open-source systems, offers robust streaming and offline inference, and provides unified architecture for speech understanding and transcription tasks (Shi et al., 29 Jan 2026).

1. Model Architecture

Qwen3-ASR-1.7B comprises a dual submodule structure: an AuT audio encoder (~300M parameters) and a Qwen3-1.7B decoder (~1.7B parameters).

  • AuT audio encoder: 12-layer Transformer encoder, dmodel=1024d_{model}=1024, with 16 attention heads (dk=dv=64d_k=d_v=64). The encoder processes 128-dimensional Fbank features, sampled at 12.5 Hz, and employs an 8× down-sampling scheme. It incorporates FlashAttention with dynamic window length (1–8 s) to support both streaming and offline inference. The feed-forward inner layer dimension is dff4096d_{ff} \approx 4096 with GELU activation.
  • Qwen3-1.7B decoder: 24-layer Transformer decoder (pretrained as part of Qwen3-Omni), dmodel1024d_{model} \approx 1024, 16 attention heads, and standard causal self-attention plus cross-attention to encoded representations. A learned projector aligns encoder and decoder feature spaces.

All key computations in the architecture follow standard Transformer formulations:

  • Multi-head self-attention:

Q=XWQ,K=XWK,V=XWVQ = X W^Q,\quad K = X W^K,\quad V = X W^V

Attention(Q,K,V)=softmax(QKTdk)V\mathrm{Attention}(Q, K, V) = \mathrm{softmax} \left(\frac{QK^T}{\sqrt{d_k}}\right) V

  • Layer normalization:

LayerNorm(x)=γxμ(x)σ(x)+β\mathrm{LayerNorm}(x) = \gamma \frac{x - \mu(x)}{\sigma(x)} + \beta

  • Feed-forward sublayer:

FFN(x)=GELU(xW1+b1)W2+b2\mathrm{FFN}(x) = \mathrm{GELU}(x W_1 + b_1) W_2 + b_2

  • Cross-entropy loss for SFT:

LCE=t=1Tv=1V1(yt=v)logpθ(yt=vy<t,audio)\mathcal{L}_{\mathrm{CE}} = -\sum_{t=1}^T \sum_{v=1}^V \mathbf{1}(y_t = v) \log p_\theta(y_t = v \mid y_{<t}, \mathrm{audio})

No novel Transformer blocks are introduced beyond the dynamic windowing in FlashAttention.

2. Pretraining and Fine-tuning Pipeline

Qwen3-ASR-1.7B training proceeds in four stages:

  1. AuT Pretraining: Conducted on approximately 40M hours of pseudo-labeled speech data (predominantly Chinese and English), using an AED-style cross-entropy objective.
  2. Omni Pretraining: The Qwen3-Omni foundation is pretrained on around 3 trillion tokens encompassing audio, vision, and text, with multi-task objectives targeted at general audio–language understanding.
  3. ASR Supervised Fine-tuning (SFT): Employs a multilingual dataset containing 30 languages and 22 Chinese dialects. Style-transfer prompts enforce a "language: X<asr_text>" output schema. Data augmentation includes non-speech detection, streaming enhancement, and context biasing techniques.
  4. ASR Reinforcement Learning (RL): Applies Group Sequence Policy Optimization (GSPO) on 50K utterances, stratified as 35% Chinese/English, 35% other languages, and 30% functional data. RL improves noise robustness, long-form transcription stability, and challenging scenario generalization.

This multi-stage pipeline allows for both extensive language coverage and robustness across realistic ASR use cases.

3. Inference and Decoding

Qwen3-ASR-1.7B utilizes autoregressive greedy decoding for next-token prediction without beam search. Language identification is integrated by prompting the model with the target language identifier, e.g., <|im_start|>assistant language <lang><asr_text>….

For streaming inference, input is chunked into 2-second windows with a 5-token fallback, retaining state over the last 4 chunks. Efficiency metrics (test conditions: single A100 GPU, bfloat16, CUDA Graph, vLLM v0.14.0) are as follows:

Concurrency Offline RTF Offline Throughput (s/s) Online TTFT_avg (ms) Online RTF Online Throughput (s/s)
1 0.01482 67.48 102 0.01483 67.43
8 0.02072 386.10 224 0.02000 400.00
64 0.07360 869.57 1597 0.06208 1030.93
128 0.13056 980.39 3392 0.10496 1219.51

RTF: Real-Time Factor. TTFT: Time To First Token.

This suggests the model scales efficiently with increased parallelization, though RTF and TTFT rise at high concurrency.

4. Evaluation and Benchmarking

Qwen3-ASR-1.7B demonstrates competitive performance against commercial APIs (GPT-4o, Gemini-2.5) and open-source models (Whisper-lv3) across standard and internal benchmarks.

Open-source benchmarks (WER/CER):

Dataset GPT-4o Gemini-2.5 Whisper-lv3 Qwen3-ASR-1.7B
LibriSpeech (clean) 1.39 2.89 1.51 1.63
LibriSpeech (other) 3.75 3.56 3.97 3.38
GigaSpeech (en) 25.50 9.37 9.76 8.45
Fleurs-en 2.40 2.94 4.08 3.35
WenetSpeech (net) 15.30 14.43 9.86 4.97
AISHELL-2 4.24 11.62 5.06 2.71
Keystone (Cantonese) 26.87 24.71 28.79 5.10
Wenet-Chuan (hard) 43.79 67.30 26.80 21.63

Internal Robustness Suite (WER):

Test GPT-4o Gemini2.5 Whisper Qwen3-0.6B Qwen3-1.7B
Accented English 28.56 23.85 21.30 16.62 16.07
Mandarin (elder) 14.27 36.93 10.61 4.48 3.81
Mandarin (noise) 36.11 29.06 63.17 17.88 16.17
Cantonese dialog 16.05 14.98 31.04 4.80 4.12
22-dialect mix 45.37 47.70 44.55 18.24 15.94

Multilingual and LID performance:

  • Average WER on Fleurs (12 languages): 4.90%
  • MLS (8 languages): 8.55%
  • CommonVoice (13 languages): 9.18%
  • Language identification accuracy (30 languages): 97.9% (Whisper-lv3: 94.1%).

Singing and Songs (WER):

Benchmark Doubao FunASR WhisperX Qwen3-1.7B
M4Singer 7.88 7.29 13.58 5.98
Opencpop 3.80 2.98 9.52 3.08
PopCS 8.97 9.42 13.77 8.52
EntireSongs-en N/A 14.60
EntireSongs-zh N/A 13.91

Streaming vs. Offline:

Mode Libri clean/other Fleurs-en Fleurs-zh
Offline 1.63 / 3.38 3.35 2.41
Streaming 1.95 / 4.51 4.02 2.84

The model exhibits SOTA open-source ASR accuracy across Mandarin, Chinese dialects, and singing benchmarks and achieves competitive results with proprietary APIs on real-world, robustness, and low-resource tasks.

5. Strengths and Limitations

Strengths:

  • State-of-the-art open-source ASR performance across English, Mandarin, regional dialects, and singing.
  • Competitive LID and transcription accuracy compared to leading proprietary APIs.
  • Unified streaming and offline inference implemented via dynamic FlashAttention.
  • Integrated prompt-based LID.
  • Open sourcing under Apache-2.0 facilitates broad adoption and further research.

Limitations:

  • Inference RTF (0.015–0.13) at 1.7B scale may hinder use in extremely low-latency, on-device scenarios.
  • Multilingual performance degrades on long-tail languages in larger benchmarks (e.g., 30-language Fleurs).
  • Decoding does not currently expose beam search, which may limit performance on noisy or ambiguous inputs where n-best rescoring is critical.
  • Reproducing the training pipeline—particularly RL and large-scale pretraining—requires substantial computational resources and expertise.

6. Context and Impact

Qwen3-ASR-1.7B places open-source ASR systems in direct competition with the most capable commercial APIs in both accuracy and real-world robustness. Its architecture exemplifies the trend toward LALM-based ASR, broad multilingualism, and strong integration with language identification. The model’s release—including code for inference and finetuning—under a permissive license substantially reduces barriers to deployment in production and research environments. A plausible implication is increased uptake in commercial, academic, and multilingually diverse ASR pipelines, especially where transparent licensing and custom model adaptation are priorities (Shi et al., 29 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Qwen3-ASR-1.7B.