Gemma 2 MITRA-MT: Buddhist Text Translation

Updated 13 January 2026

Gemma 2 MITRA-MT is a 9 billion-parameter encoder–decoder Transformer designed for translating ancient Buddhist texts from Sanskrit, Pāḷi, Buddhist Chinese, and Tibetan into English.
It employs a two-phase training pipeline, combining continuous pretraining on a 4.4B-token corpus with instruction-based fine-tuning on curated parallel corpora to optimize translation performance.
Benchmark evaluations demonstrate that Gemma 2 MITRA-MT outperforms baseline models by up to 15 GEMBA points, establishing it as a state-of-the-art system for under-resourced ancient languages.

Gemma 2 MITRA-MT is a 9 billion-parameter, domain-specialized encoder–decoder Transformer model, developed for state-of-the-art machine translation from Sanskrit, Pāḷi, Buddhist Chinese, and Tibetan into English, with a focus on ancient Buddhist literature and related multilingual, parallel corpora. It is a product of the MITRA framework, leveraging both large-scale continuous pretraining on Buddhist corpora and instruction-tuned fine-tuning to achieve superior translation quality over baseline open models and smaller, domain-specific neural machine translation systems (Nehrdich et al., 10 Jan 2026).

1. Model Architecture

Gemma 2 MITRA-MT implements the Gemma 2 9B base architecture, a Transformer with standard encoder–decoder layout. Key characteristics include:

9 billion total parameters.
32 encoder layers and 32 decoder layers.
Hidden (model) dimension of approximately 6,144 and feed-forward dimension of approximately 24,576.
Each layer contains 48 self-attention heads and incorporates rotary (RoPE) positional embeddings.
Architectural improvements inherited from Gemma 2 include optimized attention kernels, fused MLPs, and mixed-precision (fp16) training.
Continuous domain adaptation follows the "Tower recipe," with no structural modifications to the Transformer beyond this adaptation strategy.

2. Pretraining and Fine-Tuning Data

The training protocol includes two main phases: continuous pretraining and instruction-based fine-tuning for machine translation (MT).

2.1 Continuous Pretraining

The total pretraining corpus comprises approximately 4.4 billion tokens.
Data composition:
- Monolingual: 40% English academic and translated Buddhist texts, 20% Sanskrit and Pāḷi, 15% Buddhist Chinese, 5% Tibetan.
- Parallel data (20%): 1.74M sentence pairs from MITRA-parallel (combinations of Sanskrit, Chinese, Tibetan); additional parallel pairs include 1M Sanskrit↔English, 2M Tibetan↔English, 41K Tibetan↔Chinese (Kumarajiva), 31K Sanskrit↔Chinese, and 149K Pāḷi↔English.
MITRA-parallel cleaning and filtering pipeline:
1. Source documents translated to English using a domain-tuned MT model.
2. BGE M3 embeddings and k-NN aggregation to form coarse parallel clusters.
3. Sentence-level alignment performed with BERTAlign and LaBSE embeddings, applying a moving-average cosine score threshold.
4. Manual spot checks indicate 73% “perfect,” 16% “partly correct,” and 11% “wrong” alignments.

2.2 Machine-Translation Instruction Fine-Tuning

Fine-tuning data set consists of 10,000 mined multi-direction short pairs and 30,979 document-level examples, mined using Claude 3.5 Sonnet API (September 2024).
The model is fine-tuned for four epochs on this set.

3. Training Procedure and Hyperparameters

3.1 Continuous Pretraining

Optimizer: AdamW, with fp16 enabled, implemented using DeepSpeed ZeRO Stage 3.
Batch size: 2 million tokens per GPU step, with sequence length set at 1,024 tokens.
Total duration: 2 epochs over the entire 4.4 billion token corpus, requiring approximately four weeks on 8 × NVIDIA A100 (40GB) GPUs.

3.2 MT Fine-Tuning

MT objective: standard cross-entropy/negative log-likelihood objective,

$\mathcal{L}_{CE} = -\sum_{t=1}^T \log p(y_t \mid y_{<t}, x)$

Optimization and scheduling follow Gemma 2 defaults (AdamW with linear warmup and linear decay).
Batch size: approximately 512 token pairs per step.
Four epochs on the combined instruction dataset.
Dropout regularization set to 0.1; no additional label smoothing.

4. Evaluation Benchmarks and Performance

4.1 Benchmark Datasets

Benchmark evaluations target four language pairs, using the following test sets:

Buddhist Chinese→English: 2,662 sentence pairs [Nehrdich et al., 2025].
Sanskrit→English: 5,552 pairs (from sūtras, Vedic domains).
Tibetan→English: 4,053 random pairs.
Pāḷi→English: 1,900 non-canonical examples.

4.2 Metrics

Primary: GEMBA, using Gemini 2.0 Flash as judge.
Secondary: chrF and BLEURT (especially for Chinese→English).

4.3 Comparative Performance

Gemma 2 MITRA-MT demonstrates clear performance gains over competing model families and previous domain-specific systems, especially by a substantial margin in GEMBA scores:

Model	Chi→En	Tib→En	San→En	Pāḷi→En	Avg
Mistral 7B IT	56	50	68	45	55
Llama 3 8B IT	72	78	82	70	75
Qwen 2.5 7B	60	45	65	55	56
Gemma 2 9B IT	85	82	88	80	84
Gemma 3 12B IT	88	85	90	82	86
Gemma 3 27B IT	89	86	91	83	87
Gemma 2 MITRA-MT	96	98	98	90	95

For Chinese→English, detailed comparison shows:

Model	chrF	BLEURT	GEMBA
MITRA NMT ZH→EN	32.14	0.551	67.41
Gemma 2 MITRA-MT	36.59	0.579	82.78

Gemma 2 MITRA-MT exceeds the best open-model baseline (Gemma 3 27B) by approximately +15 GEMBA points across all four tested source languages, and it outperforms MITRA NMT ZH→EN by +15 GEMBA, +4.45 chrF, and +0.028 BLEURT (Nehrdich et al., 10 Jan 2026).

5. Implementation and Availability

Training and deployment leverage modern open-source and scalable deep learning infrastructure:

Pretraining executed on 8 × NVIDIA A100 (40GB) GPUs with fp16 precision and DeepSpeed ZeRO 3.
Inference is supported via HuggingFace Transformers coupled with DeepSpeed inference, enabling a throughput of approximately 1,000 tokens/second per A100 GPU at batch size 1.
The in-memory footprint is under 14 GB per instance (fp16).
All associated code, training scripts, and model weights are released on GitHub: https://github.com/dharmamitra/mitra-parallel.
Interactive, online serving is accessible through the Dharma Nexus search interface (https://dharmanexus.org).

6. Domain Significance and Research Context

Gemma 2 MITRA-MT represents a significant advance for computational linguistics in Buddhist and classical Asian literature. By leveraging a domain-tuned, large-scale parallel corpus and high-capacity neural architectures, it addresses the longstanding challenge of translating under-resourced, ancient languages with extensive textual parallels. This enables further downstream research in philological, comparative literature, and multilingual semantic retrieval applications in these domains. The openly released resources support robustness, reproducibility, and further domain adaptation (Nehrdich et al., 10 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

MITRA: A Large-Scale Parallel Corpus and Multilingual Pretrained Language Model for Machine Translation and Semantic Retrieval for Pāli, Sanskrit, Buddhist Chinese, and Tibetan (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Gemma 2 MITRA-MT.