Gemma 2 MITRA-E: Cross-Lingual Semantic Model

Updated 13 January 2026

Gemma 2 MITRA-E is a large-scale multilingual semantic embedding model that leverages contrastive learning to improve retrieval accuracy across classical Asian texts.
It employs a 9B-parameter decoder-only Transformer architecture with 32 layers and BPE tokenization, achieving state-of-the-art performance on cross-lingual benchmarks.
Domain-adapted pretraining with curated multilingual corpora and supervised retrieval finetuning enables precise semantic matching across Sanskrit, Pāḷi, Buddhist Chinese, Tibetan, and English texts.

Gemma 2 MITRA-E is a large-scale, multilingual semantic embedding model derived from the Gemma 2 architecture and incorporated into the MITRA framework for advanced cross-lingual information retrieval spanning Sanskrit, Pāḷi, Buddhist Chinese, Tibetan, and English. It couples domain-adapted pretraining with a contrastive embedding learning recipe, targeting semantic similarity and retrieval across Buddhist and classical Asian texts. MITRA-E builds on a highly specialized adaptation of the Gemma 2 LLM, establishing new state-of-the-art retrieval accuracy among open-source systems for classical language corpora (Nehrdich et al., 10 Jan 2026, Zhang et al., 8 Apr 2025).

1. Model Architecture

Gemma 2 MITRA-E utilizes a decoder-only Transformer architecture with 9 billion parameters, directly based on Gemma 2. The stack consists of 32 Transformer layers; each layer employs 32 attention heads and a hidden dimension of 4096. Tokenization is performed using byte-pair encoding (BPE) shared with Gemma 2. Sentence encoding is achieved by prepending a prompt such that the highest-layer hidden state of a designated token ("<EMBED>" or "<CLS>") is extracted as the final embedding, which is subsequently $\ell_2$ -normalized to unit length.

The embedding function thus maps each sequence to a dense vector $z \in \mathbb{R}^D$ , $D \approx 4096$ , such that $\|z\|_2 = 1$ . During retrieval, the cosine similarity $z \cdot z'$ is computed between queries and candidates. This design supports a unified interface for multilingual and cross-lingual semantic search (Nehrdich et al., 10 Jan 2026).

2. Data and Training Regimens

2.1 Domain Pretraining

Continuous domain pretraining of Gemma 2 MITRA involved a curated blend of monolingual and parallel corpora, amounting to 4.4 billion tokens over two epochs (maximum length 1024 tokens, DeepSpeed ZeRO-3, fp16 on 8×A100, effective batch 2M tokens/step):

Monolingual:
- 40% English (academic translations, OCR, deduplication)
- 20% Sanskrit + Pāḷi (digital sources)
- 15% Buddhist Chinese (CBETA XML)
- 5% Tibetan (ACIP)
Parallel:
- 20% total: 1.74M Sanskrit⟷Chinese⟷Tibetan (MITRA-parallel), 1M Sanskrit⟷English (forthcoming), 2M Tibetan⟷English (monlam.ai), 149K Pāḷi⟷English, ≈41K Tibetan⟷Chinese (Kumarajiva), 31K Sanskrit⟷Chinese

2.2 Semantic Retrieval Finetuning

Finetuning for MITRA-E targeted semantic retrieval with $\sim$ 430,000 supervised pairs across the following cohorts:

Original human-curated aligned translations:
- English→{Chinese, Pāḷi, Sanskrit, Tibetan}: 50,000 each
- Various Pāḷi, Sanskrit, Chinese, Tibetan translations (e.g., Pāḷi⟷Chinese 4,809; Sanskrit⟷Tibetan 50,000, Tibetan⟷Chinese 50,000)
Synthetic instruction-mined pairs:
- English/Same-language keywords (47,223/47,997)
- English questions (43,321), English summaries (38,882), sentence-based retrieval (51,382)

Key hyperparameters for retrieval adaptation:

Learning rate: Not specified (typical $1{\times}10^{-5}$ – $5{\times}10^{-5}$ ),
Batch size: Not specified (typically 128–512),
Temperature $\tau$ : tuned on a held-out set (often $0.05$),
Optimizer: AdamW (weight decay $z \in \mathbb{R}^D$ 0),
Epochs: 3–4 (4 for MT, likely similar for retrieval).

3. Contrastive Objective and Embedding Space

The retrieval head is trained using in-batch InfoNCE loss. For a batch of $z \in \mathbb{R}^D$ 1 query–positive pairs $z \in \mathbb{R}^D$ 2, with $z \in \mathbb{R}^D$ 3 and $z \in \mathbb{R}^D$ 4, and all other $z \in \mathbb{R}^D$ 5 ( $z \in \mathbb{R}^D$ 6) as negatives:

$z \in \mathbb{R}^D$ 7

with $z \in \mathbb{R}^D$ 8 a learnable or preset temperature. All embedding vectors are $z \in \mathbb{R}^D$ 9-normalized to the unit hypersphere, guaranteeing $D \approx 4096$ 0 for all samples.

4. Evaluation Benchmarks and Empirical Results

MITRA-E was benchmarked on a seven-task multilingual semantic similarity suite, pitted against BM25 (with English MT pivot), LaBSE, and BGE-M3 (base/fine-tuned). Retrieval is quantified via Precision@1 (P@1), @5, @10. Summary P@1 results are as follows:

Retrieval Task	MITRA-E	BGE-M3 ft	LaBSE	BM25
En $D \approx 4096$ 1Skt	95%	84%	49%	33%
En $D \approx 4096$ 2Tib	95%	76%	73%	38%
En $D \approx 4096$ 3Chi	90%	69%	53%	23%
En $D \approx 4096$ 4Pāḷi	86%	60%	30%	28%
Skt $D \approx 4096$ 5Tib	93%	77%	54%	42%
Skt $D \approx 4096$ 6Chi	79%	40%	19%	14%
Chi $D \approx 4096$ 7Tib	72%	46%	32%	17%
Skt $D \approx 4096$ 8BGh commentary	89%	60%	31%	29%
Chi $D \approx 4096$ 9T1604 commentary	88%	44%	38%	14%
Cross-lingual QA (Chi ans.)	57%	25%	12%	2%
Cross-lingual QA (Skt ans.)	56%	46%	28%	3%
Cross-lingual QA (Tib ans.)	49%	28%	15%	2%
Cross-lingual QA (Pāḷi ans.)	36%	21%	13%	2%

Across all settings, MITRA-E achieves a new state-of-the-art for open-source models in cross-lingual match and hard negative settings, with substantial improvements in low-resource classical languages (Nehrdich et al., 10 Jan 2026).

5. Adaptation to Encoder-Decoder Architectures

The broader "MITRA-E" interface, as discussed by Zhang et al., involves conversion of Gemma 2 decoder-only models to efficient encoder-decoder variants. The encoder is constructed by copying $\|z\|_2 = 1$ 0 layers from the original decoder, modifying self-attention to be bidirectional. The decoder mirrors Gemma 2 with an inserted cross-attention sublayer drawing from encoder outputs. Cross-attention is initialized from decoder self-attention (balanced) or randomly (unbalanced), with a cross-attention warmup phase in the latter.

A table of relevant model configurations is as follows:

Model	Layers	$\\|z\\|_2 = 1$ 1	$\\|z\\|_2 = 1$ 2	Heads	Params
Gemma 2 2B	26	2,304	18,432	8/4	2.0 B
Gemma 2 9B	42	3,584	28,672	16/8	8.3 B
MITRA-E 2B–2B	26–26	2,304	18,432	8/4	2.0 B
MITRA-E 9B–2B	42–26*	3,584–2,304	28,672–18,432	–	–

*Unbalanced: encoder–decoder layer counts/sizes differ.

Adaptation leverages less compute (≤2T tokens) versus scratch pretraining (8T), with performance after 30–50B tokens on par with or excelling original decoder-only models. Bidirectional encoder attention and careful cross-attention initialization are empirically necessary for adaptation efficiency (Zhang et al., 8 Apr 2025).

6. Comparative Efficiency and Ablation Insights

Balanced encoder–decoder MITRA-E variants match original Gemma 2 latency and FLOPs for equivalent total sequence lengths (e.g., input 4096 + output 4096 vs. 8192 token decoder-only), while unbalanced adapters (e.g., 9B–2B) offer the wall-clock inference performance of a 2B model with quality approaching that of a 9B model.

Ablation studies reveal:

Keeping encoder attention causal severely degrades downstream (PT/IT) task accuracy.
Proper cross-attention warmup (optimal $\|z\|_2 = 1$ 3 steps) is critical for unbalanced adapters.
Encoder grouped-query attention (GQA) is retained for consistency due to superior or comparable task performance.

Mixing PrefixLM and UL2 via weight averaging or staged training yields mixed or negative outcomes, indicating landscape incompatibility. High-scale adaptation strictly outperforms scratch pretraining in downstream metrics.

7. Significance and Application Context

Gemma 2 MITRA-E evidences that moderate-scale domain-adapted LLMs, when combined with prompt-based contrastive finetuning on semantically curated and constructed datasets, can outperform larger and more general models on specialized cross-lingual retrieval tasks in low-resource settings. The released weights and benchmarks enable both NLP research and philological investigations into ancient Buddhist texts, facilitating scalable, accurate discovery of parallel passages, translations, commentary, and answers in classical Asian languages.

The MITRA/E architecture adaptation further generalizes the utility of Gemma 2 by providing an encoder-decoder pathway with improved downstream and few/zero-shot performance, compute and inference efficiency, and ease of transfer across model sizes (Nehrdich et al., 10 Jan 2026, Zhang et al., 8 Apr 2025).

Markdown Report Issue Upgrade to Chat

References (2)

MITRA: A Large-Scale Parallel Corpus and Multilingual Pretrained Language Model for Machine Translation and Semantic Retrieval for Pāli, Sanskrit, Buddhist Chinese, and Tibetan (2026)

Encoder-Decoder Gemma: Improving the Quality-Efficiency Trade-Off via Adaptation (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Gemma 2 MITRA-E.