Llama-3 Architecture Overview
- Llama-3 is a family of dense Transformer-based models known for its scalable design, 128K-token context, and advanced efficiency features.
- It utilizes grouped-query attention and SwiGLU activation to reduce inference KV cache and improve training stability.
- Modular compositional adapters enable seamless integration of image, video, and speech modalities, advancing multimodal research.
Llama-3 is a family of dense Transformer-based foundation models engineered for native support of multilinguality, coding, reasoning, and tool usage, with extensibility to visual and audio modalities. The architecture manifests as three principal backbones ("the herd") differentiated solely by layer depth, model width, and attention head count: 8B, 70B, and 405B parameter versions. Notable for a 128K-token context window, grouped-query attention for inference efficiency, extended token vocabulary, and compositional adapters enabling multimodal integration, Llama-3 is designed for scaling stability, strong empirical performance, and comprehensive safety filtering (Grattafiori et al., 2024).
1. Core Transformer Backbone and Model Scaling
The Llama 3 family is anchored in a dense Transformer backbone. The core architectural dimensions for the released models are as follows:
| Model | Layers (L) | dₘₒdₑₗ | Attention Heads (h) | Key/Value Heads | FF Hidden Size (d_ff) | Parameters |
|---|---|---|---|---|---|---|
| 8B | 32 | 4096 | 32 | 8 | 14,336 | 8 × 10⁹ |
| 70B | 80 | 8192 | 64 | 8 | 28,672 | 7 × 10¹⁰ |
| 405B | 126 | 16,384 | 128 | 8 | 53,248 | 4.05 × 10¹¹ |
Feed-forward sublayers employ a SwiGLU activation, replacing GeLU for improved training characteristics. Rotary positional embeddings (RoPE) are utilized with a base frequency of 500,000, substantially increasing the stable context window to 128K tokens. Long-context pre-training proceeds via six continuous stages, each comprising up to ~100B additional tokens, extending the context window progressively from 8K up to 128K. Large-scale parallelism is achieved via context sharding, distributing sequence length across multiple GPUs with all-gather operations for attention keys and values.
2. Attention Mechanisms and Feed-Forward Layers
Llama 3 implements grouped-query attention (GQA) to ameliorate the inference-time KV cache footprint while retaining dense query head representations. Specifically, queries are organized into groups, but only key/value head groups are instantiated regardless of model scale. The core attention mechanism is given by:
with .
The feed-forward sublayer uses a SwiGLU-activated bottleneck:
where inputs are projected into two linear spaces and one gates the other through a sigmoid.
3. Tokenization, Positional Encoding, and Long-Context Strategy
Tokenization uses an enlarged 128K-token vocabulary (derived from 100K tiktoken tokens plus 28K additional non-English tokens), raising mean corpus compression from 3.17 to 3.94 characters/token. RoPE is applied with an increased base frequency, , to maintain stable attention scores for extended sequence lengths. The positional embedding for position in channels $2i$ and $2i+1$ is:
Document-separator attention masking precludes tokens from attending across artificially concatenated documents during batched long-sequence pre-training.
4. Compositional Multimodal Extensions
Llama 3 integrates image, video, and speech modalities through lightweight compositional adapters, maintaining modality-agnostic dense Transformer cores.
- Image: ViT-H/14 (630M params), pre-trained contrastively on 2.5B image-text pairs, produces 256×7,680 patch embeddings across five concatenated layers. Cross-attention adapters (128 heads, 8 KV heads) inject image representations into the 16K-dimensional text space every four Transformer layers; the adapter stack adds ~100B parameters atop the 405B model. Training includes 6B image-text pairs and 500M high-quality ("annealed") samples.
- Video: Up to 64 frames are sampled, each encoded by ViT as above. A temporal aggregator (Perceiver Resampler) merges every 32 frames, outputting two aggregated embeddings ("super-frames"). Video-specific cross-attention layers precede image adapters, adding ~5B parameters (70B model). Cross-modal weights are frozen during training, updating only the specific video layers on 1M video-text pairs.
- Speech: A 24-layer Conformer (1B params; 80-dimensional mel-spectrogram; stride 40ms; latent dimension 1,536) is coupled with a streaming adapter (three 3×3 stride-2 convolutions, one Transformer block, linear projection) totaling ~100M params. The text backbone is frozen, and finetuning is performed jointly on 230K hours of ASR, 90K hours of AST, and 85K hours of synthetic dialogue data.
Adapters are modular, compositional, and parameter-efficient, enabling independent modality pre-training and seamless integration with the dense text backbone.
5. Architectural Innovations and Distinctions from Llama 2
Llama 3 introduces several architectural modifications over Llama 2 to improve efficiency, context length, and data compression:
- Replacement of standard multi-head attention with grouped-query attention, significantly reducing KV cache memory usage.
- Document-separator attention masks ensure proper attention boundaries in batched, long-sequence training.
- Expanded vocabulary, increasing coverage and tokenization efficiency for multilingual and non-English inputs.
- RoPE base frequency augmentation, enabling stable relative position encodings across orders-of-magnitude larger sequences.
- SwiGLU replaces GeLU as the default feed-forward activation.
- Maintains a fully dense Transformer architecture, eschewing Mixture-of-Experts layers to maximize stability for scale-out training.
6. Safety Framework: Llama Guard 3
Safety in Llama 3 is enforced both in-model and at system level via the Llama Guard 3 module. This guard is an 8B-parameter Llama 3 variant finetuned as a multi-label classifier over 13 harm categories (including hate, self-harm, and CSAM) plus code-abuse. There are two wrappers:
- Input Filter: Rejects or redirects unsafe user prompts.
- Output Filter: Inspects model outputs, applying refusal protocols to unsafe generations.
Both filters use the same finetuned backbone, but are calibrated for distinct data distributions. Empirical evaluation shows reduction in violation rate by 40–100%, with a 10–100% relative increase in false refusals. The classifier can be quantized to int8 format with minimal degradation in precision or recall (Grattafiori et al., 2024).
7. Summary and Empirical Significance
Llama 3 exemplifies a design philosophy prioritizing dense, stable scaling, efficient context management, and modular extensibility for multimodal research. The compositional adapter strategy allows specialized perception modules to be pretrained and fused with minimal architectural modifications. Systematic safety infrastructure addresses practical deployment requirements. Empirical evaluation indicates that Llama 3 achieves quality competitive with contemporary LLMs such as GPT-4 across diverse linguistic, multimodal, and reasoning benchmarks (Grattafiori et al., 2024).