Papers
Topics
Authors
Recent
Search
2000 character limit reached

Gemma 2 MITRA-E: Cross-Lingual Semantic Model

Updated 13 January 2026
  • Gemma 2 MITRA-E is a large-scale multilingual semantic embedding model that leverages contrastive learning to improve retrieval accuracy across classical Asian texts.
  • It employs a 9B-parameter decoder-only Transformer architecture with 32 layers and BPE tokenization, achieving state-of-the-art performance on cross-lingual benchmarks.
  • Domain-adapted pretraining with curated multilingual corpora and supervised retrieval finetuning enables precise semantic matching across Sanskrit, Pāḷi, Buddhist Chinese, Tibetan, and English texts.

Gemma 2 MITRA-E is a large-scale, multilingual semantic embedding model derived from the Gemma 2 architecture and incorporated into the MITRA framework for advanced cross-lingual information retrieval spanning Sanskrit, Pāḷi, Buddhist Chinese, Tibetan, and English. It couples domain-adapted pretraining with a contrastive embedding learning recipe, targeting semantic similarity and retrieval across Buddhist and classical Asian texts. MITRA-E builds on a highly specialized adaptation of the Gemma 2 LLM, establishing new state-of-the-art retrieval accuracy among open-source systems for classical language corpora (Nehrdich et al., 10 Jan 2026, Zhang et al., 8 Apr 2025).

1. Model Architecture

Gemma 2 MITRA-E utilizes a decoder-only Transformer architecture with 9 billion parameters, directly based on Gemma 2. The stack consists of 32 Transformer layers; each layer employs 32 attention heads and a hidden dimension of 4096. Tokenization is performed using byte-pair encoding (BPE) shared with Gemma 2. Sentence encoding is achieved by prepending a prompt such that the highest-layer hidden state of a designated token ("<EMBED>" or "<CLS>") is extracted as the final embedding, which is subsequently 2\ell_2-normalized to unit length.

The embedding function thus maps each sequence to a dense vector zRDz \in \mathbb{R}^D, D4096D \approx 4096, such that z2=1\|z\|_2 = 1. During retrieval, the cosine similarity zzz \cdot z' is computed between queries and candidates. This design supports a unified interface for multilingual and cross-lingual semantic search (Nehrdich et al., 10 Jan 2026).

2. Data and Training Regimens

2.1 Domain Pretraining

Continuous domain pretraining of Gemma 2 MITRA involved a curated blend of monolingual and parallel corpora, amounting to 4.4 billion tokens over two epochs (maximum length 1024 tokens, DeepSpeed ZeRO-3, fp16 on 8×A100, effective batch 2M tokens/step):

  • Monolingual:
    • 40% English (academic translations, OCR, deduplication)
    • 20% Sanskrit + Pāḷi (digital sources)
    • 15% Buddhist Chinese (CBETA XML)
    • 5% Tibetan (ACIP)
  • Parallel:
    • 20% total: 1.74M Sanskrit⟷Chinese⟷Tibetan (MITRA-parallel), 1M Sanskrit⟷English (forthcoming), 2M Tibetan⟷English (monlam.ai), 149K Pāḷi⟷English, ≈41K Tibetan⟷Chinese (Kumarajiva), 31K Sanskrit⟷Chinese

2.2 Semantic Retrieval Finetuning

Finetuning for MITRA-E targeted semantic retrieval with \sim430,000 supervised pairs across the following cohorts:

  • Original human-curated aligned translations:
    • English→{Chinese, Pāḷi, Sanskrit, Tibetan}: 50,000 each
    • Various Pāḷi, Sanskrit, Chinese, Tibetan translations (e.g., Pāḷi⟷Chinese 4,809; Sanskrit⟷Tibetan 50,000, Tibetan⟷Chinese 50,000)
  • Synthetic instruction-mined pairs:
    • English/Same-language keywords (47,223/47,997)
    • English questions (43,321), English summaries (38,882), sentence-based retrieval (51,382)

Key hyperparameters for retrieval adaptation:

  • Learning rate: Not specified (typical 1×1051{\times}10^{-5}5×1055{\times}10^{-5}),
  • Batch size: Not specified (typically 128–512),
  • Temperature τ\tau: tuned on a held-out set (often $0.05$),
  • Optimizer: AdamW (weight decay 0.01\approx 0.01),
  • Epochs: 3–4 (4 for MT, likely similar for retrieval).

3. Contrastive Objective and Embedding Space

The retrieval head is trained using in-batch InfoNCE loss. For a batch of NN query–positive pairs (qi,pi)(q_i, p_i), with zi=fθ(qi)z_i = f_\theta(q_i) and zi+=fθ(pi)z_i^+ = f_\theta(p_i), and all other pjp_j (jij \neq i) as negatives:

LInfoNCE=log(exp(zizi+/τ)j=1Nexp(zizj+/τ))L_{\mathrm{InfoNCE}} = -\log \left( \frac{\exp(z_i \cdot z_i^+ / \tau)}{\sum_{j=1}^N \exp(z_i \cdot z_j^+ / \tau)} \right)

with τ\tau a learnable or preset temperature. All embedding vectors are 2\ell_2-normalized to the unit hypersphere, guaranteeing z2=1||z||_2 = 1 for all samples.

4. Evaluation Benchmarks and Empirical Results

MITRA-E was benchmarked on a seven-task multilingual semantic similarity suite, pitted against BM25 (with English MT pivot), LaBSE, and BGE-M3 (base/fine-tuned). Retrieval is quantified via Precision@1 (P@1), @5, @10. Summary P@1 results are as follows:

Retrieval Task MITRA-E BGE-M3 ft LaBSE BM25
En\rightarrowSkt 95% 84% 49% 33%
En\rightarrowTib 95% 76% 73% 38%
En\rightarrowChi 90% 69% 53% 23%
En\rightarrowPāḷi 86% 60% 30% 28%
Skt\leftrightarrowTib 93% 77% 54% 42%
Skt\leftrightarrowChi 79% 40% 19% 14%
Chi\leftrightarrowTib 72% 46% 32% 17%
Skt\rightarrowBGh commentary 89% 60% 31% 29%
Chi\rightarrowT1604 commentary 88% 44% 38% 14%
Cross-lingual QA (Chi ans.) 57% 25% 12% 2%
Cross-lingual QA (Skt ans.) 56% 46% 28% 3%
Cross-lingual QA (Tib ans.) 49% 28% 15% 2%
Cross-lingual QA (Pāḷi ans.) 36% 21% 13% 2%

Across all settings, MITRA-E achieves a new state-of-the-art for open-source models in cross-lingual match and hard negative settings, with substantial improvements in low-resource classical languages (Nehrdich et al., 10 Jan 2026).

5. Adaptation to Encoder-Decoder Architectures

The broader "MITRA-E" interface, as discussed by Zhang et al., involves conversion of Gemma 2 decoder-only models to efficient encoder-decoder variants. The encoder is constructed by copying KK layers from the original decoder, modifying self-attention to be bidirectional. The decoder mirrors Gemma 2 with an inserted cross-attention sublayer drawing from encoder outputs. Cross-attention is initialized from decoder self-attention (balanced) or randomly (unbalanced), with a cross-attention warmup phase in the latter.

A table of relevant model configurations is as follows:

Model Layers dmodeld_\mathrm{model} dffnd_\mathrm{ffn} Heads Params
Gemma 2 2B 26 2,304 18,432 8/4 2.0 B
Gemma 2 9B 42 3,584 28,672 16/8 8.3 B
MITRA-E 2B–2B 26–26 2,304 18,432 8/4 2.0 B
MITRA-E 9B–2B 42–26* 3,584–2,304 28,672–18,432

*Unbalanced: encoder–decoder layer counts/sizes differ.

Adaptation leverages less compute (≤2T tokens) versus scratch pretraining (8T), with performance after 30–50B tokens on par with or excelling original decoder-only models. Bidirectional encoder attention and careful cross-attention initialization are empirically necessary for adaptation efficiency (Zhang et al., 8 Apr 2025).

6. Comparative Efficiency and Ablation Insights

Balanced encoder–decoder MITRA-E variants match original Gemma 2 latency and FLOPs for equivalent total sequence lengths (e.g., input 4096 + output 4096 vs. 8192 token decoder-only), while unbalanced adapters (e.g., 9B–2B) offer the wall-clock inference performance of a 2B model with quality approaching that of a 9B model.

Ablation studies reveal:

  • Keeping encoder attention causal severely degrades downstream (PT/IT) task accuracy.
  • Proper cross-attention warmup (optimal K0=1000K_0=1000 steps) is critical for unbalanced adapters.
  • Encoder grouped-query attention (GQA) is retained for consistency due to superior or comparable task performance.

Mixing PrefixLM and UL2 via weight averaging or staged training yields mixed or negative outcomes, indicating landscape incompatibility. High-scale adaptation strictly outperforms scratch pretraining in downstream metrics.

7. Significance and Application Context

Gemma 2 MITRA-E evidences that moderate-scale domain-adapted LLMs, when combined with prompt-based contrastive finetuning on semantically curated and constructed datasets, can outperform larger and more general models on specialized cross-lingual retrieval tasks in low-resource settings. The released weights and benchmarks enable both NLP research and philological investigations into ancient Buddhist texts, facilitating scalable, accurate discovery of parallel passages, translations, commentary, and answers in classical Asian languages.

The MITRA/E architecture adaptation further generalizes the utility of Gemma 2 by providing an encoder-decoder pathway with improved downstream and few/zero-shot performance, compute and inference efficiency, and ease of transfer across model sizes (Nehrdich et al., 10 Jan 2026, Zhang et al., 8 Apr 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Gemma 2 MITRA-E.