Papers
Topics
Authors
Recent
Search
2000 character limit reached

Qwen-3 Transformer Architecture

Updated 11 January 2026
  • Qwen-3 Transformer Architecture is a family of LLMs featuring flexible multilingual modeling, hierarchical reasoning, and multimodal capabilities with parameterizations ranging from 0.6 to 235B.
  • It leverages both dense and Mixture-of-Experts variants and introduces innovative thinking/non-thinking inference modes with advanced positional encodings and expert routing.
  • The architecture underpins state-of-the-art performance in code generation, mathematical reasoning, and vision-language tasks, and is open-sourced to promote academic research.

Qwen-3 refers to a family of Transformer-based LLMs that emphasizes flexible multilingual language modeling, hierarchical reasoning protocols, and multimodal capabilities with parameterization ranging from 0.6 to 235 billion. Architecturally, Qwen-3 encompasses both dense and Mixture-of-Experts (MoE) variants. Notable design choices include the introduction of "thinking" and "non-thinking" inference modes within a unified model, a thinking budget mechanism to trade off inference latency and reasoning depth, and innovations in positional encodings and multi-branch expert routing. Qwen-3 underpins subsequent multimodal systems, such as Qwen3-VL and Qwen3-Omni, which expand into vision, audio, and video reasoning under this architecture (Yang et al., 14 May 2025, Xu et al., 22 Sep 2025, Bai et al., 26 Nov 2025).

1. Model Family, Parameterizations, and Tokenization

Qwen-3 models are released in both dense and sparse (MoE) formats, covering a spectrum of model sizes and architectural capacities:

Model Layers Query/KV Heads Experts (MoE) Parameters Context Window
Qwen3-0.6B 28 16/8 0.6B 32K
Qwen3-1.7B 28 16/8 1.7B 32K
Qwen3-4B 36 32/8 4B 128K
Qwen3-8B 36 32/8 8B 128K
Qwen3-14B 40 40/8 14B 128K
Qwen3-32B 64 64/8 32B 128K
Qwen3-30B-A3B 48 32/4 128 30B (3B actv) 128K
Qwen3-235B-A22B 94 64/4 128 235B (22B actv) 128K

All models utilize a byte-level BPE tokenizer with a 151,669 vocabulary size (Yang et al., 14 May 2025).

A core architectural principle is the support for extensive multilingual data, extending native generation and comprehension from 29 languages (Qwen2.5) to 119 languages and dialects in Qwen-3, facilitated by a massive 36T-token multilingual pretraining corpus (Yang et al., 14 May 2025, Xu et al., 22 Sep 2025).

2. Transformer Block Composition and Expert Routing

Each Qwen-3 block implements pre-normalization using RMSNorm, Grouped-Query Attention (GQA), rotary embeddings (RoPE), and a SwiGLU feed-forward head. Self-attention employs QK-Norm for stability:

  • RMSNorm: x^=xgx\,\hat{x} = x \cdot \frac{\|g\|}{\|x\|}\, with gg as a learned scale vector.
  • Self-Attention with RoPE+QK-Norm:
    • Q=Wqx^Q=W_q\hat{x}, K=Wkx^K=W_k\hat{x}, V=Wvx^V=W_v\hat{x}
    • Q,K=RoPE(Q),RoPE(K)Q', K' = \operatorname{RoPE}(Q), \operatorname{RoPE}(K)
    • Q~=Q/Qc\,\tilde{Q} = Q' / \|Q'\| \cdot c, K~=K/Kc\,\tilde{K} = K' / \|K'\| \cdot c
    • A=softmax(Q~K~T/dk)A = \operatorname{softmax}(\tilde{Q}\tilde{K}^T/\sqrt{d_k}), y=AVy = AV, yout=x+Woyy_\text{out} = x + W_o y
  • SwiGLU Feed-Forward: z=W2(GELU(W1x^)W1x^)z = W_2 (\operatorname{GELU}(W_1 \hat{x}) \odot W_1' \hat{x}), x=yout+zx' = y_\text{out} + z (Yang et al., 14 May 2025)

In MoE variants, a linear routing mechanism selects top-kk experts per token. Each expert is a two-layer FFN. Tokens are routed via router softmax, and expert activation is governed by global capacity and a load-balancing auxiliary loss, LbalanceL_\text{balance}, to encourage uniform expert utilization:

MoE(xi)=etop-kgieExperte(xi)\mathrm{MoE}(x_i) = \sum_{e \in \text{top-}k} g_{ie} \cdot \mathrm{Expert}_e(x_i)

Capacity per expert is C=(Ntokensk/E)αC = \lceil (N_{\text{tokens}} \cdot k / E) \cdot \alpha \rceil with α=1.25\alpha = 1.25. Load balancing is evaluated by λEe[pe]Ee[le]\lambda \cdot E_e[p_e] \cdot E_e[l_e] (Yang et al., 14 May 2025).

3. Unified Thinking and Non-Thinking Inference Mechanism

Qwen-3 introduces a prompt-controlled, dual-mode inference protocol allowing a trade-off between rapid responses and explicit intermediate reasoning:

  • Non-Thinking Mode: No intermediate reasoning (chain-of-thought); response is emitted immediately, marked with /no_think or empty > </think> tags.

    • Thinking Mode: Generates explicit step-wise or chain-of-thought reasoning within <think>... spans, controlled by a user-specified reasoning token budget BB.
  • Inference Process: Prompt structure is:

1
2
<|im_start|>user {query} {/think or /no_think} <|im_end|>
<|im_start|>assistant <think> ... </think> {final answer} <|im_end|>

  • Thinking Budget: BB specifies the maximum tokens for intermediary reasoning. If BB is reached before completion, a stop token is appended forcibly.

Accuracy on reasoning-denominated tasks increases as BB grows, enabling users to explicitly tune latency/accuracy at inference (Yang et al., 14 May 2025). This mechanism is implemented within a single model, removing the need for separate chat-optimized vs. reasoning-optimized deployments.

4. Multimodal and Long-Context Extensions

Qwen-3 serves as the backbone for vision-language (VL) and multimodal variants (Omni):

  • Qwen3-VL: Extends context window up to 256K tokens (with YaRN-augmented RoPE), and supports interleaved text, image, and video. Key technical upgrades include Interleaved-MRoPE (multiaxis rotary embeddings across temporal, height, and width), DeepStack multi-level vision-language fusion (injecting hierarchical ViT features into early LLM layers), and explicit textual timestamp alignment for precise video understanding (Bai et al., 26 Nov 2025).
  • Qwen3-Omni: Implements a “Thinker–Talker” MoE pipeline: a large MoE Transformer for cross-modal reasoning (“Thinker”) and a compact MoE speech generator (“Talker”), unified by TM-RoPE for all input modalities. The system achieves 234 ms first-packet latency for audio and supports streaming generation for text, image, audio, and video tasks (Xu et al., 22 Sep 2025).
  • Positional Encoding: All multimodal models use advanced rotary schemes (Interleaved-MRoPE, TM-RoPE) and textual time alignment for robust, long-context retention in both vision and video.

5. Pretraining, Data Mixture, and Optimization

Pretraining is conducted on 36T tokens, distributed over three curriculum phases:

  1. General: 30T tokens, sequence length 4,096, broad language coverage.
  2. Reasoning-Enhanced: 5T STEM/code, accelerated learning rate decay, seq=4,096.
  3. Long-Context: Hundreds of billions of tokens, long-sequence (32,768+), extended frequency RoPE (up to 10610^6), Attention Beyond Frobenius (ABF) scaling, YaRN + dual chunk attention kernels for up to 128K tokens inference. (Yang et al., 14 May 2025)

Data sources include curated web, books, code, synthetic data (Qwen2.5-Coder/Math), and PDF-extracted text. Multilingual annotation-labeled tokens enable fine-grained weighting. The optimizer is AdamW—weight decay and batch-size are scheduled per scaling laws. Standard normal initialization is used for weights; QK-Norm constants are tuned for stable training.

In VL and Omni variants, vision features are extracted via SigLIP2 or related ViT modules. Vision-language alignment is further refined via DeepStack mergers and frozen-backbone warm starts during pretraining on large-scale image–text/OCR corpora (Bai et al., 26 Nov 2025).

6. Performance, Applications, and Latency-Quality Tradeoffs

Qwen-3 achieves state-of-the-art results across code generation, mathematical reasoning, agent tasks, and a spectrum of code and vision-language benchmarks (Yang et al., 14 May 2025, Bai et al., 26 Nov 2025). MoE models—with sparse expert activation—are competitive with or superior to larger dense and proprietary models at lower computational cost per token. Representative results include:

Model Context Length Accuracy (Reasoning QA) Latency (ms/token)
Qwen3-32B Dense 128K 77.8% 2.0
Qwen3-30B-A3B 128K 80.1% 1.8
Qwen3-VL-235B-A22B 256K 100% up to 30min video

Empirical evaluations demonstrate that increasing the thinking token budget BB improves downstream accuracy while raising inference latency in a predictable, user-tunable fashion (Yang et al., 14 May 2025).

Qwen3-VL achieves 100% accuracy on Needle-in-a-Haystack for 256K context and 99.5% on extrapolated 1M tokens with YaRN (Bai et al., 26 Nov 2025). In video and multimodal QA tasks, dense and MoE models from the Qwen-3 family match or outperform previous models with reduced parameter budgets and shorter per-token latencies.

7. Open Source Status and Future Directions

All Qwen-3 models, spanning dense and MoE, as well as multimodal extensions (VL, Omni), are released under the Apache 2.0 license. This includes flagship, instruction-tuned, and specialist models (e.g., for audio captioning). Public availability is intended to foster reproducibility, transparency, and continued academic research (Yang et al., 14 May 2025, Xu et al., 22 Sep 2025, Bai et al., 26 Nov 2025).

A plausible implication is that the unified thinking/non-thinking protocol and scalable MoE frameworks of Qwen-3 will inform subsequent generative model design, particularly in areas needing controllable reasoning, multimodal capability, and adaptive inference budgets under diverse computational constraints.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (3)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Qwen-3 Transformer Architecture.