Gemma 2 MITRA-MT: Buddhist Text Translation
- Gemma 2 MITRA-MT is a 9 billion-parameter encoder–decoder Transformer designed for translating ancient Buddhist texts from Sanskrit, Pāḷi, Buddhist Chinese, and Tibetan into English.
- It employs a two-phase training pipeline, combining continuous pretraining on a 4.4B-token corpus with instruction-based fine-tuning on curated parallel corpora to optimize translation performance.
- Benchmark evaluations demonstrate that Gemma 2 MITRA-MT outperforms baseline models by up to 15 GEMBA points, establishing it as a state-of-the-art system for under-resourced ancient languages.
Gemma 2 MITRA-MT is a 9 billion-parameter, domain-specialized encoder–decoder Transformer model, developed for state-of-the-art machine translation from Sanskrit, Pāḷi, Buddhist Chinese, and Tibetan into English, with a focus on ancient Buddhist literature and related multilingual, parallel corpora. It is a product of the MITRA framework, leveraging both large-scale continuous pretraining on Buddhist corpora and instruction-tuned fine-tuning to achieve superior translation quality over baseline open models and smaller, domain-specific neural machine translation systems (Nehrdich et al., 10 Jan 2026).
1. Model Architecture
Gemma 2 MITRA-MT implements the Gemma 2 9B base architecture, a Transformer with standard encoder–decoder layout. Key characteristics include:
- 9 billion total parameters.
- 32 encoder layers and 32 decoder layers.
- Hidden (model) dimension of approximately 6,144 and feed-forward dimension of approximately 24,576.
- Each layer contains 48 self-attention heads and incorporates rotary (RoPE) positional embeddings.
- Architectural improvements inherited from Gemma 2 include optimized attention kernels, fused MLPs, and mixed-precision (fp16) training.
- Continuous domain adaptation follows the "Tower recipe," with no structural modifications to the Transformer beyond this adaptation strategy.
2. Pretraining and Fine-Tuning Data
The training protocol includes two main phases: continuous pretraining and instruction-based fine-tuning for machine translation (MT).
2.1 Continuous Pretraining
- The total pretraining corpus comprises approximately 4.4 billion tokens.
- Data composition:
- Monolingual: 40% English academic and translated Buddhist texts, 20% Sanskrit and Pāḷi, 15% Buddhist Chinese, 5% Tibetan.
- Parallel data (20%): 1.74M sentence pairs from MITRA-parallel (combinations of Sanskrit, Chinese, Tibetan); additional parallel pairs include 1M Sanskrit↔English, 2M Tibetan↔English, 41K Tibetan↔Chinese (Kumarajiva), 31K Sanskrit↔Chinese, and 149K Pāḷi↔English.
- MITRA-parallel cleaning and filtering pipeline:
- Source documents translated to English using a domain-tuned MT model.
- BGE M3 embeddings and k-NN aggregation to form coarse parallel clusters.
- Sentence-level alignment performed with BERTAlign and LaBSE embeddings, applying a moving-average cosine score threshold.
- Manual spot checks indicate 73% “perfect,” 16% “partly correct,” and 11% “wrong” alignments.
2.2 Machine-Translation Instruction Fine-Tuning
Fine-tuning data set consists of 10,000 mined multi-direction short pairs and 30,979 document-level examples, mined using Claude 3.5 Sonnet API (September 2024).
- The model is fine-tuned for four epochs on this set.
3. Training Procedure and Hyperparameters
3.1 Continuous Pretraining
- Optimizer: AdamW, with fp16 enabled, implemented using DeepSpeed ZeRO Stage 3.
- Batch size: 2 million tokens per GPU step, with sequence length set at 1,024 tokens.
- Total duration: 2 epochs over the entire 4.4 billion token corpus, requiring approximately four weeks on 8 × NVIDIA A100 (40GB) GPUs.
3.2 MT Fine-Tuning
- MT objective: standard cross-entropy/negative log-likelihood objective,
- Optimization and scheduling follow Gemma 2 defaults (AdamW with linear warmup and linear decay).
- Batch size: approximately 512 token pairs per step.
- Four epochs on the combined instruction dataset.
- Dropout regularization set to 0.1; no additional label smoothing.
4. Evaluation Benchmarks and Performance
4.1 Benchmark Datasets
Benchmark evaluations target four language pairs, using the following test sets:
- Buddhist Chinese→English: 2,662 sentence pairs [Nehrdich et al., 2025].
- Sanskrit→English: 5,552 pairs (from sūtras, Vedic domains).
- Tibetan→English: 4,053 random pairs.
- Pāḷi→English: 1,900 non-canonical examples.
4.2 Metrics
- Primary: GEMBA, using Gemini 2.0 Flash as judge.
- Secondary: chrF and BLEURT (especially for Chinese→English).
4.3 Comparative Performance
Gemma 2 MITRA-MT demonstrates clear performance gains over competing model families and previous domain-specific systems, especially by a substantial margin in GEMBA scores:
| Model | Chi→En | Tib→En | San→En | Pāḷi→En | Avg |
|---|---|---|---|---|---|
| Mistral 7B IT | 56 | 50 | 68 | 45 | 55 |
| Llama 3 8B IT | 72 | 78 | 82 | 70 | 75 |
| Qwen 2.5 7B | 60 | 45 | 65 | 55 | 56 |
| Gemma 2 9B IT | 85 | 82 | 88 | 80 | 84 |
| Gemma 3 12B IT | 88 | 85 | 90 | 82 | 86 |
| Gemma 3 27B IT | 89 | 86 | 91 | 83 | 87 |
| Gemma 2 MITRA-MT | 96 | 98 | 98 | 90 | 95 |
For Chinese→English, detailed comparison shows:
| Model | chrF | BLEURT | GEMBA |
|---|---|---|---|
| MITRA NMT ZH→EN | 32.14 | 0.551 | 67.41 |
| Gemma 2 MITRA-MT | 36.59 | 0.579 | 82.78 |
Gemma 2 MITRA-MT exceeds the best open-model baseline (Gemma 3 27B) by approximately +15 GEMBA points across all four tested source languages, and it outperforms MITRA NMT ZH→EN by +15 GEMBA, +4.45 chrF, and +0.028 BLEURT (Nehrdich et al., 10 Jan 2026).
5. Implementation and Availability
Training and deployment leverage modern open-source and scalable deep learning infrastructure:
- Pretraining executed on 8 × NVIDIA A100 (40GB) GPUs with fp16 precision and DeepSpeed ZeRO 3.
- Inference is supported via HuggingFace Transformers coupled with DeepSpeed inference, enabling a throughput of approximately 1,000 tokens/second per A100 GPU at batch size 1.
- The in-memory footprint is under 14 GB per instance (fp16).
- All associated code, training scripts, and model weights are released on GitHub: https://github.com/dharmamitra/mitra-parallel.
- Interactive, online serving is accessible through the Dharma Nexus search interface (https://dharmanexus.org).
6. Domain Significance and Research Context
Gemma 2 MITRA-MT represents a significant advance for computational linguistics in Buddhist and classical Asian literature. By leveraging a domain-tuned, large-scale parallel corpus and high-capacity neural architectures, it addresses the longstanding challenge of translating under-resourced, ancient languages with extensive textual parallels. This enables further downstream research in philological, comparative literature, and multilingual semantic retrieval applications in these domains. The openly released resources support robustness, reproducibility, and further domain adaptation (Nehrdich et al., 10 Jan 2026).