Papers
Topics
Authors
Recent
Search
2000 character limit reached

Gemma 3 Foundation Models Overview

Updated 15 January 2026
  • Gemma 3 foundation models are a family of open decoder-only transformers supporting vision-language multimodality, over 100 languages, and input contexts up to 128K tokens.
  • They incorporate innovative techniques like interleaved local/global attention, rotary embeddings, and quantization aware training to optimize memory efficiency and reasoning capabilities.
  • These models serve as adaptable backbones for diverse applications including document understanding, geolocation, code execution, and medical imaging in both research and industry.

Gemma 3 foundation models are a family of open, decoder-only transformer architectures supporting vision-language multimodality, multilinguality across 100+ languages, and ultra-long context windows up to at least 128K tokens. Ranging from 1 to 27 billion parameters, Gemma 3 introduces novel architectural and training recipes to minimize memory consumption, enhance reasoning skills (math, instruction-following, code), and enable deployment via a practical quantization pipeline. The models serve as general-purpose foundation backbones and are widely adapted for specialized tasks in research and industry.

1. Model Architecture and Scaling

Gemma 3 spans four primary model sizes: 1B, 4B, 12B, and 27B parameters. All variants share a decoder-only transformer architecture. Key details include:

  • Componentization: The 4B, 12B, and 27B models embed a frozen SigLIP ViT-400M vision encoder (417M parameters), absent in the 1B variant.
  • Token Embeddings: From 302M (1B) up to 1,416M (27B).
  • Non-embedding Transformer Weights: Range from 698M (1B) to 25,600M (27B).
  • Attention Design: All models implement a 5:1 ratio of local to global self-attention layers. Local attention utilizes a sliding window mask with span W=1024W=1024, while global layers use full self-attention.
  • Context Capacity: The 1B variant supports 32K tokens; all larger variants support up to 128K tokens.

Distinct architectural features include Grouped Query Attention (GQA), both pre-norm and post-norm transformer block variants, RMSNorm, QK normalization, and the extensive use of rotary positional embeddings (RoPE), with frequency scaling extended for global attention layers ($10$k to $1$M base) to support long contexts (Team et al., 25 Mar 2025).

2. Multimodal and Multilingual Capabilities

Vision Integration

Gemma 3 incorporates vision via a frozen SigLIP ViT-400M encoder, which converts each 896×896896\times896 input image to a sequence of $256$ dvd_v-dimensional patch embeddings: Esiglip(I)R256×dvE_\text{siglip}(I) \in \mathbb{R}^{256\times d_v} These are linearly projected into the model's hidden dimension (dmodeld_\text{model}) as "soft tokens" and concatenated with text tokens. The architecture supports "Pan-and-Scan," which tiles high-resolution or non-square images into NN\leq crops, concatenating all resulting tokens at inference.

Language Coverage

Pretraining data span over 100 languages, utilizing a balanced monolingual and parallel corpus mixture. Language sampling weights follow a UniMax fair-mix policy, reducing bias toward data-rich languages. Tokenization uses a multilingual SentencePiece vocabulary of 262K tokens, combining byte-level encoding, digit splitting, and whitespace preservation.

Multilingual Performance

Significant improvement over prior Gemma models is observed, particularly at 27B scale:

  • Global-MMLU-Lite: 68.6 → 75.1 (+6.5pts)
  • XQuAD: 73.9 → 76.8
  • FLoRes: 44.3 → 48.8
  • IndicGenBench: 62.1 → 63.4 (Team et al., 25 Mar 2025)

3. Attention, Memory, and Long-Context Efficiency

Local vs. Global Attention

Alternating blocks implement local attention (window W=1024W=1024) and global attention (full context). Local block attention is masked as: Mij={0if ijW/2 otherwiseM_{ij} = \begin{cases} 0 & \text{if } |i-j| \leq W/2 \ -\infty & \text{otherwise} \end{cases} This design substantially reduces key-value (KV) cache requirements. For example, at 32K context length, KV cache comprises only 15% of total memory in Gemma 3 (compared to ≈60% for global-only architectures). Overall, the model achieves up to 4×4\times lower memory growth in KV storage with near-linear scaling in context length.

Rotary Embeddings

Global attention layers use high-base-frequency (up to $1$M) RoPE, with extension via interpolation to $128$K tokens; local layers retain a $10k$ RoPE base (Team et al., 25 Mar 2025).

Quantization

Quantization Aware Training (QAT) enables int4 and switched fp8 (SFP8) inference. The 27B model at 32K context compresses from 72.7 GB (bf16) to 32.8 GB (int4), facilitating deployment on commercial GPUs and TPUs.

4. Pretraining and Post-training Regimes

Pretraining

Pretraining token budgets increase with model scale (2T for 1B, up to 14T for 27B, including both text and image tokens). Sequence-level distillation is used: for each token, a set of 256 logits is sampled from a teacher, and the student minimizes a cross-entropy over these.

Data filtering policies enforce safety, personal information removal, and quality reweighting.

Quantization Aware Training

All models are briefly fine-tuned (~5K steps) to support robust int4 and SFP8 weight formats.

Post-training (Instruction Tuning and RL)

A two-stage process is followed:

  1. Supervised Distillation: Imitating a large instruction-tuned teacher via cross-entropy on logits.
  2. Reinforcement Learning Fine-tuning: Combines BOND, WARM, and WARP objectives with reward functions reflecting human feedback, code execution correctness, and ground-truth math checking. A PPO-style loss regularizes the policy away from divergence: JRL(θ)=Eyπθ[R(x,y)]βKL[πθπ0]J_{RL}(\theta) = \mathbb{E}_{y\sim\pi_\theta}[R(x, y)] - \beta\,\mathrm{KL}[\pi_\theta||\pi_0]

Data are filtered to exclude unsafe and hallucination-prone examples and to promote hedging or refusals where appropriate (Team et al., 25 Mar 2025).

5. Empirical Benchmarks and Application Domains

General and Specialized Evaluation

Zero-shot Instruction Tuning Benchmarks (27B-IT)

  • MMLU-Pro: 67.5
  • LiveCodeBench: 29.7
  • Bird-SQL: 54.4
  • MATH: 89.0
  • Global MMLU-Lite: 75.1
  • MMMU (val): 64.9

Pretraining Benchmarks

  • MMLU: 78.6 (zero-shot, 27B)
  • GSM8K: 82.6
  • COCO Caption (val Cider): 144

Human/Automatic Rating

  • Chatbot Arena Elo (27B-IT): 1,338 (vs. Gemini-1.5-Pro: 1,302; GPT-4.5-Preview: 1,411)

Long-context Tasks

Performance is sustained for RULER and MRCR benchmarks up to 128K tokens, demonstrating robust ultra-long-context reasoning (Team et al., 25 Mar 2025).

Practical Deployment

Gemma 3 models run efficiently on multi-A100/TPUv5p hardware. Quantized weights and local-global attention enable inference on consumer GPUs at smaller scales. Typical applications include document understanding (128K tokens), code execution with long in-context chains, multilingual support, and retrieval-augmented tasks.

The architecture enables state-of-the-art document-level processing and is adopted as a backbone for domain-specific models, such as GeoLocSFT (visual geolocation), modular "internal world" models for tabular prediction tasks, and domain-adapted MedGemma foundation models for healthcare (Yi et al., 2 Jun 2025, Jadouli et al., 20 Apr 2025, Sellergren et al., 7 Jul 2025).

6. Adaptation and Domain-specific Extensions

Visual Geolocation (GeoLocSFT)

GeoLocSFT demonstrates efficient adaptation of Gemma 3 via LoRA-based supervised fine-tuning. Using only 2,700 geo-captioned image/GPS pairs, the instruction-tuned 27B variant achieves leading Acc@R performance on Im2GPS-3k, YFCC-4k, and MR40k, showing that Gemma 3's unified vision-language backbone can absorb complex geo-reasoning signals with minimal additional data. Multi-candidate LLM self-consistency improves performance only marginally over single-pass SFT, underlining the strength of Gemma 3's representation (Yi et al., 2 Jun 2025).

Modular Internal World (Tabular Prediction)

Frozen Gemma 3 transformer layers are re-used as a “pretrained internal world” for tabular data tasks. Custom input/output projection heads are trained while keeping Gemma layers fixed, drastically reducing trainable parameters to ≈5.6% of total and yielding competitive F1 and recall versus fully fine-tuned baselines on wildfire prediction. Layer freezing stabilizes calibration and interpretability, and the methodology generalizes to other domains where labeled data is limited (Jadouli et al., 20 Apr 2025).

Medical Vision-Language Modeling (MedGemma)

MedGemma is a family of medical vision-LLMs derived from Gemma 3 4B and 27B checkpoints, using the same backbone but domain-adapted pre- and post-training. The MedSigLIP encoder is tuned on 33M medical image-text pairs, boosting radiological VQA (SLAKE +32.1 F1), chest X-ray classification (+15.5 F1 on CheXpert), and EHR retrieval (>+10% gains). Generalist strengths are largely retained, with only small drops on MMLU-Pro and MMMU (Sellergren et al., 7 Jul 2025).

7. Impact and Best Practices

Gemma 3 extends the paradigm of open, efficient, and widely deployable foundation models. Key innovations—local-global attention interleaving, RoPE scaling, robust quantization, and flexible vision-language integration—enable practical research and application at multiple scales. The family is broadly applied as a “backbone” for downstream SFT or modular plug-in approaches, thereby accelerating progress in global geolocation, environmental forecasting, and medical artificial intelligence.

Best practices highlighted include matching hidden sizes for adapters, maximizing the number of frozen weights to encourage data efficiency, leveraging mixed precision and checkpointing for memory flexibility, and investing in high-quality, small SFT datasets rather than scaling weakly labeled collections (Team et al., 25 Mar 2025, Yi et al., 2 Jun 2025, Jadouli et al., 20 Apr 2025, Sellergren et al., 7 Jul 2025).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Gemma 3 Foundation Models.