Papers
Topics
Authors
Recent
Search
2000 character limit reached

ByT5-Sanskrit: Byte-Level Model for Sanskrit

Updated 13 February 2026
  • ByT5-Sanskrit is a byte-level Transformer model designed for Sanskrit, leveraging raw UTF-8 tokenization to capture its unique orthographic and morphological features.
  • It employs span corruption pretraining on over 5.43 billion IAST-byte characters, achieving superior performance in tasks like word segmentation, OCR correction, and dependency parsing.
  • Its multitask fine-tuning across diverse Sanskrit NLP tasks enables robust generalization, making it valuable for applications such as accent restoration, lemmatization, and digital philology.

ByT5-Sanskrit is a byte-level, encoder–decoder Transformer LLM specifically pretrained and fine-tuned for natural language processing tasks in Sanskrit, a morphologically rich and historically significant, but digitally low-resource, Indo-European language. Leveraging raw UTF-8 byte tokenization and large-scale romanized (IAST/SLP1) Sanskrit corpora, ByT5-Sanskrit provides a unified, deployment-efficient foundation for a spectrum of downstream computational linguistics tasks—from word segmentation, lemmatization, and morphosyntactic tagging, to dependency parsing, OCR post-correction, accent restoration in Vedic texts, and poetry-to-prose transformation. The model, and its variant baselines, establish new state-of-the-art results by obviating the limitations of subword tokenization and lexicon-driven approaches, demonstrating that byte-level models are particularly well-suited for morphologically complex, highly compounding, and inflectional languages such as Sanskrit (Nehrdich et al., 2024, P et al., 28 Nov 2025, Maheshwari et al., 2022, Das et al., 11 Nov 2025).

1. Model Architecture and Pretraining

ByT5-Sanskrit adopts the “ByT5-Base” Transformer specification (12 encoder and 12 decoder layers, dmodeld_{\text{model}}=768, dffd_{\text{ff}}=3072, 12 attention heads, learned positional embeddings), with ∼582M parameters (Nehrdich et al., 2024). It processes raw Unicode byte sequences (256-token vocabulary, no subword splits), which allows complete coverage of Sanskrit’s orthographic and phonetic diversity. Pretraining employs the “span corruption” (denoising/infilling) objective, where random spans are replaced by sentinels and the model reconstructs the original sequence:

Lpretrain=ExEx~corrupt(x)  t=1Llogpθ(xtx~).\mathcal{L}_{\mathrm{pretrain}} = -\,\mathbb{E}_{x}\,\mathbb{E}_{\tilde x\sim \mathrm{corrupt}(x)}\;\sum_{t=1}^{L}\log p_{\theta}\bigl(x_t\mid \tilde x\bigr).

The pretraining corpus consists of ≈5.43 billion IAST-byte characters, sourced from IndicLLMSuite, GRETIL, and the Digital Sanskrit Buddhist Canon (Nehrdich et al., 2024). Preprocessing includes script transliteration to IAST (or SLP1 for some baselines), exclusion of synthetic splits, and strict byte-encoding, with Unicode normalization (NFD) for diacritic–combining mark separation where required (P et al., 28 Nov 2025).

2. Tokenization Strategies and Phonetic Normalization

All input strings are romanized to IAST or SLP1 before byte-tokenization. The SLP1-based variant (ByT5+SLP1) is notable for post-OCR correction: Devanāgarī input is transliterated into an ASCII-encoded SLP1 sequence that makes all phonetic/orthographic distinctions explicit, then split into bytes (Maheshwari et al., 2022). For accent restoration in Vedic Sanskrit, Unicode Normalization Form D (NFD) separates base characters from accent-mark combining bytes, enabling the model to learn accent placement in a “mark-aware” fashion (P et al., 28 Nov 2025). This pure byte-level approach eliminates out-of-vocabulary and byte-length variability, supporting high type-token ratios and cross-domain robustness (Nehrdich et al., 2024, Maheshwari et al., 2022).

3. Multitask, Multidomain Fine-Tuning

ByT5-Sanskrit is fine-tuned jointly across tasks including word segmentation (splitting compounds and sandhi), lemmatization, and morphosyntactic tagging, using data from the Digital Corpus of Sanskrit (601,403 sentences, with domain diversity spanning philosophy, mathematics, Vedic ritual, and more) (Nehrdich et al., 2024). Each task is serialized with a unique prefix and processed as a byte sequence. Fine-tuning setups employ AdamW (β₁=0.9, β₂=0.999, ε=1e-6), learning rate 3e-4 with warmup/decay, 512-sample batches, mixed precision, and task-balanced sampling for multitask scenarios.

For poetry-to-prose (anvaya) tasks, task-specific prefixes and domain-driven parallel corpora are used (Das et al., 11 Nov 2025); for Vedic accent restoration, a parallel corpus of accented–unaccented ślokas is utilized, with full fine-tuning or LoRA-based parameter-efficient adjustment (P et al., 28 Nov 2025).

4. Evaluation Metrics and Empirical Results

Table 1 details representative benchmark results for ByT5-Sanskrit on core Sanskrit NLP and related MRL tasks.

Task/Benchmark ByT5-Sanskrit Comparison Baseline Metric(s) Source
Word Segmentation (DCS 2018) 90.11% PM rcNN-SS: 85.2% Perfect Match (%) (Nehrdich et al., 2024)
Hackathon Segmentation 94.29% PM TransLIST: 85.47% Perfect Match (%) (Nehrdich et al., 2024)
OCR Post-correction (Maheshwari) 2.69% CER ByT5-Small: 2.98% CER, WER (Nehrdich et al., 2024)
Vedic Dependency Parsing (ALL) 89.04% UAS Biaffine: 86.86% UAS, LAS (Nehrdich et al., 2024)
Poetry-to-Prose BLEU (MBh Test) 38.63 Phi4-14B (IFT): 33.12 BLEU, Kendall's Tau (Das et al., 11 Nov 2025)
Accent Placement DER (Rigveda) 0.0685 BiLSTM-CRF: 0.3197 DER, CER, WER (P et al., 28 Nov 2025)

ByT5-Sanskrit consistently attains state-of-the-art results: in word segmentation, it surpasses rcNN-SS and matches or exceeds lexicon-driven models; in OCR post-correction, it achieves a relative 23% drop in both CER and WER compared to best previous byte-level baselines (Maheshwari et al., 2022, Nehrdich et al., 2024). In Rigvedic accent restoration, full ByT5 fine-tuning yields lowest diacritic error rates (DER=0.0685) and outperforms both LoRA-efficient variants and BiLSTM-CRF approaches (P et al., 28 Nov 2025). For poetry-to-prose, ByT5-Sanskrit outperforms instruction-tuned LLMs (e.g., Phi4-14B) by ≈ 5 BLEU points, and cross-domain generalization remains robust, indicating genuine rule assimilation (Das et al., 11 Nov 2025).

5. Analysis: Error Patterns, Generalization, and Ablation

Comprehensive error analysis reveals the following:

  • In OCR correction, ByT5+SLP1 substantially reduces boundary (−44.45%), mātra/diacritic (−37.74%), and character confusion (−14.16%) errors compared to raw OCR (Maheshwari et al., 2022).
  • For multitask fine-tuning on segmentation, lemmatization, and morphosyntax, ablation shows that removing multitask signals drops sentence-level perfect match scores by 1–2 points, confirming beneficial cross-task transfer (Nehrdich et al., 2024).
  • On the DCS, 53.3% of analyzed errors are attributable to gold-data faults or ambiguities, with ByT5-Sanskrit able to “correct” some ground truth annotations, thus serving as an implicit validator (Nehrdich et al., 2024).
  • Accent placement models confirm that context-dependent accent diacritics require large context and mark-aware input, as BiLSTM-CRF has moderate WER/CER but fails on DER (P et al., 28 Nov 2025).
  • In poetry-to-prose, Kendall’s Tau strongly correlates with human expert judgments (ρ ≈ 0.82), while BLEU is less aligned with structural fidelity (Das et al., 11 Nov 2025).

6. Impact and Downstream Applications

ByT5-Sanskrit is integral to a range of Sanskrit NLP pipelines:

Practical recommendations for deployment include Unicode-normalization (NFD) for accent-mark preservation, adoption of pure byte-level tokenization to avoid OOV and byte-length inconsistencies, and preference for full fine-tuning in accuracy-critical tasks, while LoRA tuning offers memory-efficient alternatives (P et al., 28 Nov 2025). The open-source release of code, pretrained models, and multitask checkpoints facilitates reproducibility (Nehrdich et al., 2024).

7. Significance for MRLs and General NLP

ByT5-Sanskrit demonstrates that byte-level pretrained models outperform subword-tokenizer-based systems and match or exceed lexicon-dependent models for languages with high inflection, compounding, and phoneme–grapheme transparency (Nehrdich et al., 2024, Maheshwari et al., 2022). The unified architecture supports robust generalization across scripts, genres, and annotation schemes, and its design philosophy—eschewing external tokenizers and lexica—positions it as a transferable paradigm for other under-resourced, morphologically rich languages. Its success in joint multitask settings, cross-linguistic lemmatization, and structure-sensitive generation tasks further establishes the byte-level Seq2Seq approach as foundational for modern NLP in complex language environments (Nehrdich et al., 2024, Das et al., 11 Nov 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ByT5-Sanskrit.