Qwen-3 and Llama-3 Model Families

Updated 2 February 2026

Qwen-3 and Llama-3 are dense Transformer-based foundation models designed for language, code, reasoning, and multimodal tasks.
Llama-3 employs advanced techniques such as RoPE, grouped-query attention, and compositional integration for vision, video, and speech.
Both model families use extensive training and safety alignment strategies, with Llama-3 offering higher benchmark scores and extended context support.

The Qwen-3 and Llama-3 model families represent contemporary dense Transformer-based foundation models targeting high performance across language, code, reasoning, and multimodal tasks. These systems are natively multilingual, support tool usage, and in the case of Llama 3, offer compositional integration of vision, video, and speech capabilities. The following presents an in-depth comparative overview covering architectural configuration, training methods, benchmarked capabilities, safety and alignment strategies, and multimodal extensions, focusing on those features explicitly documented for Llama 3 and Qwen-3 (Grattafiori et al., 2024).

1. Model Architectures and Hyperparameters

Both Qwen-3 and Llama-3 employ standard dense Transformer architectures with RoPE positional embeddings and grouped-query attention (GQA). Key configuration details are outlined in the table below.

Model	Layers	Hidden Dim.	Attention Heads (GQA KV)	Context Window	Parameters
Llama 3 8 B	32	4,096	32 (8)	8K / 128K	8 B
Llama 3 70 B	80	8,192	64 (8)	8K / 128K	70 B
Llama 3 405 B	126	16,384	128 (8)	8K / 128K	405 B
Qwen-3 7 B	32	4,096	32	8K / 32K	7 B
Qwen-3 70 B	80	8,192	64	8K / 32K	70 B

Llama 3 uses RoPE with a base of $θ=500,000$ . Rotary position encoding is defined for token $i, j$ as $q' = q \cdot e^{i θ(j)}$ , $\theta(j) \propto j/500{,}000$ . GQA is applied as grouped-query attention with one query per head group, enhancing speed.

The Llama 3 context window is baseline 8K tokens, with continued pre-training pushing to 128K in six incremental stages. Qwen-3 offers up to 8K and some 32K-token variants, but no further public context is available.

2. Training Methodology

Llama 3 utilizes a two-stage process:

Pre-training: Next-token prediction objective with cross-entropy loss over 15T multilingual tokens using AdamW optimizer. The pre-training data mix is annealed: 50% general web knowledge, 25% math and reasoning, 17% code, 8% multilingual. The Llama 3 405 B parameter model consumed 3.8×10²⁵ FLOPs. Continued pre-training extended the context window to 128K tokens using approximately 800B additional tokens.
Post-training (Alignment): Instruction-tuning with supervised fine-tuning (SFT), rejection sampling (RS), and Direct Preference Optimization (DPO). DPO loss is given by

$L_{DPO}(\theta) = -\mathbb{E}_{x, r_+, r_-} \left[\log \sigma\left(\beta(\log p_\theta(r_+|x) - \log p_\theta(r_-|x))\right)\right]$

using about 1M human-preference comparisons.

Qwen-3 is also built using dense Transformer pre-training on less than 10T tokens, but further details about dataset composition and alignment steps are not publicly detailed.

3. Performance Benchmarks

Both model families provide evaluative scores on established language and code benchmarks, with Llama 3 generally matching or exceeding Qwen-3 at similar parameter scales.

Task (Prompting)	L3 8 B	Q3 7 B	L3 70 B	Q3 70 B	L3 405 B	GPT-4
MMLU (5-shot)	69.4	68.5	83.6	80.2	87.3	89.1
GSM8K (8-shot, CoT)	84.5	81.0	95.1	92.4	96.8	96.1
HumanEval (0-shot)	72.6	70.2	80.5	76.8	89.0	86.6
MGSM (0-shot, CoT)	68.9	65.0	86.9	81.4	91.6	85.9
QuALITY (5-shot)	81.0	78.0	90.5	—	95.2	95.2

Scaling laws studies yielded a compute-optimal configuration of approximately 402B parameters at 16.55T tokens. IsoFLOPs curves, modeled as second-degree polynomials, inform optimal model-data trade-offs.

4. Safety and Alignment Strategies

Llama 3 includes Llama Guard 3, an 8B parameter safety-classifier, multi-label across 13 harm categories plus code-abuse, and deployable as input/output filter.

On English prompts, Llama 3 405 B alone yields $VR \approx 18\%$ , $FRR \approx 2\%$ .
With Llama Guard 3 filtering, $VR \to 3\%$ (–86%), $FRR \to 4\%$ (+102%).
Violation Rate (VR) is the percentage of safe prompts leading to unsafe responses; False Refusal Rate (FRR) is the percentage of borderline prompts refused despite the answer being safe.
Alignment data includes size-class-tuned safety SFT mixes (adversarial and borderline) and DPO fine-tuning, achieving reductions in VR with minimal FRR increase.

Details of Qwen-3's safety-specific mitigations and post-training alignment are less fully detailed in the public record.

5. Multimodal and Compositional Extensions

Llama 3 introduces compositional, adapter-based integration for vision, video, and speech modalities:

Vision: Flamingo-style cross-attention using a ViT-H/14 encoder (850M parameters), adapters (≈100B params for L3 405 B) injected after every four LLM layers. Zero-shot performance: VQAv2 78.3%, ChartQA 70.1%, DocVQA 78.9%, approaching GPT-4V baselines.
Video: A Perceiver resampler aggregates frames with an additional 4.6B video-attention parameters added atop the image-adapter pipeline. On PerceptionTest (video MC-QA): L3 70 B achieves 60.4%, TVQA 75.2%, closely following GPT-4V.
Speech: A 24-layer Conformer encoder (1B params) trained on 15M hours of speech, downstreamed through a 100M-parameter adapter pipeline producing token-rate embeddings for the LM. On ASR, L3 70 B yields 1.7% WER (LibriSpeech test-clean), outperforming Whisper v2 (1.9%); BLEU for AST on FLEURS 33→En: 33.7 (L3 70 B) vs 21.9 (Whisper v2); CoVoST2 (15→En): 38.8 vs 33.8. Spoken-dialog zero-shot and multi-turn code-switching are demonstrated (toxicity VR <1%, LT >15%).

TTS components use LM embeddings as cross-attention context, enabling streaming inference with minimal lookahead and 60–64% human rater preference over non-LM-conditioned baselines.

While both families explore multimodal extensions, Llama 3 explicitly documents performance on major datasets for vision, video, and speech.

6. Comparative Summary

Both Llama 3 and Qwen-3 families leverage standard dense Transformers, RoPE, and GQA. Llama 3's largest model at 405 B parameters exceeds Qwen-3’s top publicly-documented size (70 B). Llama 3 further advances model context support (up to 128K tokens), applies an annealed, highly diverse pre-training data mix (totaling 15T tokens), and demonstrates edge or state-of-the-art performance in a variety of zero- and few-shot benchmarks. Safety is systematically addressed with specialized classifiers (Llama Guard 3) and alignment protocols. Llama 3's publicly available releases and detailed reporting on multimodal and compositional extensions distinguish its documentation and systems evaluation practices in the landscape of foundation models (Grattafiori et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

The Llama 3 Herd of Models (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Qwen-3 and Llama-3 Families.