Gemma 3n LLM is a series of open-source decoder-only Transformer models designed for efficient, multilingual, and multimodal processing.
The models integrate interleaved local/global attention and quantization-aware training to reduce memory overhead and enhance performance.
Optimized with LoRA adapters and Darija instruction tuning, they achieve significant gains in low-resource dialect tasks and cross-lingual benchmarks.
Gemma 3n refers to the Gemma 3 series of LLMs, notably including the Gemma 3–4B and Gemma 3–27B variants, characterized by efficient open-source decoder‐only Transformer architectures, multilingual and multimodal capabilities, and innovations in memory, alignment, and training protocols. These models have demonstrated state-of-the-art performance in both English and low-resource dialects such as Moroccan Arabic (Darija), while enabling scalable and energy-efficient tuning for domain-specific tasks (Skiredj et al., 20 May 2025, Team et al., 25 Mar 2025).
1. Architectural Foundations
Gemma 3 models maintain a decoder‐only Transformer backbone with Grouped‐Query Attention and RMSNorm. A distinguishing innovation is the interleaved local/global attention layer scheme: for every global self-attention layer, five local self-attention layers are interposed, with local layers attending to a sliding window of S=1024 tokens. Global attention layers compute full-sequence attention over length N, while local layers restrict attention to the most recent S tokens,
Consequently, the total KV-cache memory per token reduces from Mglobal only∝N×d×L to MGemma3∝NdLg+SdLℓ, with Lg=L/6 global and Lℓ=5L/6 local layers. Empirical analysis shows KV-cache overhead at N=32k drops from ∼60% (Gemma 2) to under 15% in Gemma 3, with negligible impact on perplexity (Team et al., 25 Mar 2025).
2. Model Variant Specifications
Gemma 3 is released in several parameter scales: 1 B, 4 B, 12 B, and 27 B, supporting context windows up to 128,000 tokens via RoPE positional-embedding rescaling (except 1 B, limited to 32 K). The sequence rescaling on global layers applies a factor f=4 so tokens beyond 32 K receive valid positional signals. Both Gemma 3–4B and Gemma 3–27B variants adopt BF16 precision and a SentencePiece tokenizer with 262,000 entries. These models display strong zero-shot capabilities across languages, mathematics, scientific, and commonsense reasoning (Team et al., 25 Mar 2025).
3. Training and Instruction Tuning Protocols
Pretraining follows a distilled student protocol: mixing text and images (2 T–14 T tokens, increasing with size), quality re-weighting, and safety filtering. Distillation samples m=256 logits per token from the teacher distribution pT, zeros unsampled logits, renormalizes, and minimizes the cross-entropy loss,
Ldistill=−ℓ∈S∑pT(ℓ)logpS(ℓ).
Post-training involves instruction tuning ("Gemma3-IT") via supervised distillation (best-of-N method, BOND) and RL fine-tuning with reward models (WARM, WARP) focused on factuality, helpfulness, code execution, mathematical correctness, multilinguality, and safety,
A quality-over-quantity alignment strategy surfaces latent proficiency in Darija, a marginalized Moroccan Arabic dialect. The pipeline translates and filters three prominent instruction suites—LIMA 1K, DEITA 6K, and TULU 50K—into Arabic-script Darija using the Gemini 2.0 Flash API, prompt engineering, and code block/LaTeX preservation. For cross-lingual robustness and to prevent catastrophic forgetting, 20% of each suite's data is retained in English.
Suite
Total Samples
Darija (≈ %)
English (≈ %)
LIMA 1K
1,000
700 (70%)
300 (30%)
DEITA 6K
5,000
3,700 (74%)
1,300 (26%)
TULU 50K
46,000
33,000 (72%)
13,000 (28%)
Adopting parameter-efficient LoRA adapters, the 4B model employs LoRA rank r=32, α=64, while the 27B model uses r=16, α=32. Training proceeds on mixed Darija/English data, with 15 epochs (LIMA), 6 (DEITA), 3 (TULU), and respective learning rates (4×10−4 for LIMA/DEITA, 1×10−4 for TULU). Compute remains minimal: <100 GPU·h and $<\$100cloudcost(<ahref="/papers/2505.17082"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Skiredjetal.,20May2025</a>).</p><h2class=′paper−heading′id=′quantitative−performance−and−scaling−effects′>5.QuantitativePerformanceandScalingEffects</h2><p>Gemma3modelsdemonstratesubstantialgainsinDarijaproficiencypostLoRAtuning:</p><divclass=′overflow−x−automax−w−fullmy−4′><tableclass=′tableborder−collapsew−full′style=′table−layout:fixed′><thead><tr><th>Model/Benchmark</th><th>Size</th><th>DarijaMMLU</th><th>HellaSwag</th><th>GSM8K@5</th><th><ahref="https://www.emergentmind.com/topics/massive−multi−task−language−understanding−mmlu"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">MMLU</a></th></tr></thead><tbody><tr><td>Gemma3–4B(untuned)</td><td>4B</td><td>32.8<td>36.3<td>74.8<td>51.1</tr><tr><td>+LoRALIMA1K</td><td>4B</td><td>34.9<td>39.3<td>51.2<td>29.3</tr><tr><td>+LoRADEITA6K</td><td>4B</td><td>42.7<td>44.3<td>53.2<td>51.4</tr><tr><td>+LoRATULU50K</td><td>4B</td><td>47.5<td>47.1<td>56.0<td>54.1</tr></tbody></table></div><divclass=′overflow−x−automax−w−fullmy−4′><tableclass=′tableborder−collapsew−full′style=′table−layout:fixed′><thead><tr><th>Model/Benchmark</th><th>Size</th><th>DarijaMMLU</th><th>HellaSwag</th><th>GSM8K@5</th><th>MMLU</th></tr></thead><tbody><tr><td>Atlas−Chat−27B</td><td>27B</td><td>61.9<td>48.4<td>82.0<td>72.1</tr><tr><td>GemMaroc−27B(LoRATULU50K)</td><td>27B</td><td>61.6<td>60.5<td>84.2<td>73.6</tr></tbody></table></div><p>Post−alignment,the4Bmodelachievesa\DeltaDarijaMMLUof+14.73ppand\DeltaHellaSwagof+10.83pp;the27Bmodelmatchesorexceedspriorstate−of−the−artonDarijaMMLUandshows+12.1ppoverAtlas−Chat−27BonDarijaHellaSwag.Cross−lingualandmathematicalbenchmarkretentionisalsoachieved(GSM8Kmovesfrom82.0\timesgainonDarijaMMLUand1.28\timesonHellaSwag,withnegligibleEnglishregression(<ahref="/papers/2505.17082"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Skiredjetal.,20May2025</a>).</p><h2class=′paper−heading′id=′multimodal−and−multilingual−capabilities′>6.MultimodalandMultilingualCapabilities</h2><p>Gemma3integratesvision−languagesupportthroughafrozen400 M−parameter<ahref="https://www.emergentmind.com/topics/sigmoid−contrastive−language−image−pre−training−siglip"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">SigLIP</a>VisionTransformer,encodingimagesinto16\times$16 patch embeddings, average-pooled to 256 soft tokens. Inference uses Pan-and-Scan tiling for artifact-free high-resolution image handling, yielding up to +17 points on document VQA tasks, and achieving 85.6 CIDEr (COCO Caption) and 59.4 ANLS (InfoVQA) for the 4B model. Multilingual performance is enhanced by UniMax-inspired data mixing, with the 27B model attaining 75.7% GMMLU, 76.8 F1 XQuAD, and outperforming Gemma 2 (Team et al., 25 Mar 2025).
7. Energy Efficiency and Green AI Implications
Fine-tuning Gemma 3 with LoRA adapters and curated Darija instruction suites requires only 58 GPU·h (10 GPU·h on A100 for 4B, 48 GPU·h on H100 for 27B), resulting in total energy consumption of 32 kWh ($\sim$13 kg CO$_2e).ThiscontrastssharplywiththeAtlas−Chat−27Bfullfine−tuneat1.4 MWh(>$610 kg CO$_2e),representingovera48\times$ reduction in energy and 98% lower emissions. The recipe demonstrates a Green AI pathway: inclusive, sustainable, low-resource dialect tuning without sacrificing performance (Skiredj et al., 20 May 2025).
8. Practical Deployment and Model Release
All bfloat16 and quantized checkpoints are released under an open license. Quantized weights and KV-cache at 32 K context reduce the memory footprint to 1.4–7.3 GB. Code, model cards, and formatting scripts are available to facilitate further research, educational, public service, and everyday digital applications centered on dialect inclusivity and computational efficiency (Team et al., 25 Mar 2025, Skiredj et al., 20 May 2025).