ByT5-Sanskrit: Byte-Level Model for Sanskrit

Updated 13 February 2026

ByT5-Sanskrit is a byte-level Transformer model designed for Sanskrit, leveraging raw UTF-8 tokenization to capture its unique orthographic and morphological features.
It employs span corruption pretraining on over 5.43 billion IAST-byte characters, achieving superior performance in tasks like word segmentation, OCR correction, and dependency parsing.
Its multitask fine-tuning across diverse Sanskrit NLP tasks enables robust generalization, making it valuable for applications such as accent restoration, lemmatization, and digital philology.

ByT5-Sanskrit is a byte-level, encoder–decoder Transformer LLM specifically pretrained and fine-tuned for natural language processing tasks in Sanskrit, a morphologically rich and historically significant, but digitally low-resource, Indo-European language. Leveraging raw UTF-8 byte tokenization and large-scale romanized (IAST/SLP1) Sanskrit corpora, ByT5-Sanskrit provides a unified, deployment-efficient foundation for a spectrum of downstream computational linguistics tasks—from word segmentation, lemmatization, and morphosyntactic tagging, to dependency parsing, OCR post-correction, accent restoration in Vedic texts, and poetry-to-prose transformation. The model, and its variant baselines, establish new state-of-the-art results by obviating the limitations of subword tokenization and lexicon-driven approaches, demonstrating that byte-level models are particularly well-suited for morphologically complex, highly compounding, and inflectional languages such as Sanskrit (Nehrdich et al., 2024, P et al., 28 Nov 2025, Maheshwari et al., 2022, Das et al., 11 Nov 2025).

1. Model Architecture and Pretraining

ByT5-Sanskrit adopts the “ByT5-Base” Transformer specification (12 encoder and 12 decoder layers, $d_{\text{model}}$ =768, $d_{\text{ff}}$ =3072, 12 attention heads, learned positional embeddings), with ∼582M parameters (Nehrdich et al., 2024). It processes raw Unicode byte sequences (256-token vocabulary, no subword splits), which allows complete coverage of Sanskrit’s orthographic and phonetic diversity. Pretraining employs the “span corruption” (denoising/infilling) objective, where random spans are replaced by sentinels and the model reconstructs the original sequence:

$\mathcal{L}_{\mathrm{pretrain}} = -\,\mathbb{E}_{x}\,\mathbb{E}_{\tilde x\sim \mathrm{corrupt}(x)}\;\sum_{t=1}^{L}\log p_{\theta}\bigl(x_t\mid \tilde x\bigr).$

The pretraining corpus consists of ≈5.43 billion IAST-byte characters, sourced from IndicLLMSuite, GRETIL, and the Digital Sanskrit Buddhist Canon (Nehrdich et al., 2024). Preprocessing includes script transliteration to IAST (or SLP1 for some baselines), exclusion of synthetic splits, and strict byte-encoding, with Unicode normalization (NFD) for diacritic–combining mark separation where required (P et al., 28 Nov 2025).

2. Tokenization Strategies and Phonetic Normalization

All input strings are romanized to IAST or SLP1 before byte-tokenization. The SLP1-based variant (ByT5+SLP1) is notable for post-OCR correction: Devanāgarī input is transliterated into an ASCII-encoded SLP1 sequence that makes all phonetic/orthographic distinctions explicit, then split into bytes (Maheshwari et al., 2022). For accent restoration in Vedic Sanskrit, Unicode Normalization Form D (NFD) separates base characters from accent-mark combining bytes, enabling the model to learn accent placement in a “mark-aware” fashion (P et al., 28 Nov 2025). This pure byte-level approach eliminates out-of-vocabulary and byte-length variability, supporting high type-token ratios and cross-domain robustness (Nehrdich et al., 2024, Maheshwari et al., 2022).

3. Multitask, Multidomain Fine-Tuning

ByT5-Sanskrit is fine-tuned jointly across tasks including word segmentation (splitting compounds and sandhi), lemmatization, and morphosyntactic tagging, using data from the Digital Corpus of Sanskrit (601,403 sentences, with domain diversity spanning philosophy, mathematics, Vedic ritual, and more) (Nehrdich et al., 2024). Each task is serialized with a unique prefix and processed as a byte sequence. Fine-tuning setups employ AdamW (β₁=0.9, β₂=0.999, ε=1e-6), learning rate 3e-4 with warmup/decay, 512-sample batches, mixed precision, and task-balanced sampling for multitask scenarios.

For poetry-to-prose (anvaya) tasks, task-specific prefixes and domain-driven parallel corpora are used (Das et al., 11 Nov 2025); for Vedic accent restoration, a parallel corpus of accented–unaccented ślokas is utilized, with full fine-tuning or LoRA-based parameter-efficient adjustment (P et al., 28 Nov 2025).

4. Evaluation Metrics and Empirical Results

Table 1 details representative benchmark results for ByT5-Sanskrit on core Sanskrit NLP and related MRL tasks.

Task/Benchmark	ByT5-Sanskrit	Comparison Baseline	Metric(s)	Source
Word Segmentation (DCS 2018)	90.11% PM	rcNN-SS: 85.2%	Perfect Match (%)	(Nehrdich et al., 2024)
Hackathon Segmentation	94.29% PM	TransLIST: 85.47%	Perfect Match (%)	(Nehrdich et al., 2024)
OCR Post-correction (Maheshwari)	2.69% CER	ByT5-Small: 2.98%	CER, WER	(Nehrdich et al., 2024)
Vedic Dependency Parsing (ALL)	89.04% UAS	Biaffine: 86.86%	UAS, LAS	(Nehrdich et al., 2024)
Poetry-to-Prose BLEU (MBh Test)	38.63	Phi4-14B (IFT): 33.12	BLEU, Kendall's Tau	(Das et al., 11 Nov 2025)
Accent Placement DER (Rigveda)	0.0685	BiLSTM-CRF: 0.3197	DER, CER, WER	(P et al., 28 Nov 2025)

ByT5-Sanskrit consistently attains state-of-the-art results: in word segmentation, it surpasses rcNN-SS and matches or exceeds lexicon-driven models; in OCR post-correction, it achieves a relative 23% drop in both CER and WER compared to best previous byte-level baselines (Maheshwari et al., 2022, Nehrdich et al., 2024). In Rigvedic accent restoration, full ByT5 fine-tuning yields lowest diacritic error rates (DER=0.0685) and outperforms both LoRA-efficient variants and BiLSTM-CRF approaches (P et al., 28 Nov 2025). For poetry-to-prose, ByT5-Sanskrit outperforms instruction-tuned LLMs (e.g., Phi4-14B) by ≈ 5 BLEU points, and cross-domain generalization remains robust, indicating genuine rule assimilation (Das et al., 11 Nov 2025).

5. Analysis: Error Patterns, Generalization, and Ablation

Comprehensive error analysis reveals the following:

In OCR correction, ByT5+SLP1 substantially reduces boundary (−44.45%), mātra/diacritic (−37.74%), and character confusion (−14.16%) errors compared to raw OCR (Maheshwari et al., 2022).
For multitask fine-tuning on segmentation, lemmatization, and morphosyntax, ablation shows that removing multitask signals drops sentence-level perfect match scores by 1–2 points, confirming beneficial cross-task transfer (Nehrdich et al., 2024).
On the DCS, 53.3% of analyzed errors are attributable to gold-data faults or ambiguities, with ByT5-Sanskrit able to “correct” some ground truth annotations, thus serving as an implicit validator (Nehrdich et al., 2024).
Accent placement models confirm that context-dependent accent diacritics require large context and mark-aware input, as BiLSTM-CRF has moderate WER/CER but fails on DER (P et al., 28 Nov 2025).
In poetry-to-prose, Kendall’s Tau strongly correlates with human expert judgments (ρ ≈ 0.82), while BLEU is less aligned with structural fidelity (Das et al., 11 Nov 2025).

6. Impact and Downstream Applications

ByT5-Sanskrit is integral to a range of Sanskrit NLP pipelines:

Word segmentation, lemmatization, and tagging in annotation projects and information retrieval (Nehrdich et al., 2024)
Preprocessing in Sanskrit machine translation
Accent-aware OCR and ASR for chant synthesis (P et al., 28 Nov 2025)
Digital philology and pedagogy, especially for accent restoration and prose alignment (P et al., 28 Nov 2025, Das et al., 11 Nov 2025)
Lemmatization and dependency parsing for other MRLs (e.g., Turkish IMST, Romanian RRT, Bulgarian BTB), where it yields new best scores (Nehrdich et al., 2024)

Practical recommendations for deployment include Unicode-normalization (NFD) for accent-mark preservation, adoption of pure byte-level tokenization to avoid OOV and byte-length inconsistencies, and preference for full fine-tuning in accuracy-critical tasks, while LoRA tuning offers memory-efficient alternatives (P et al., 28 Nov 2025). The open-source release of code, pretrained models, and multitask checkpoints facilitates reproducibility (Nehrdich et al., 2024).

7. Significance for MRLs and General NLP

ByT5-Sanskrit demonstrates that byte-level pretrained models outperform subword-tokenizer-based systems and match or exceed lexicon-dependent models for languages with high inflection, compounding, and phoneme–grapheme transparency (Nehrdich et al., 2024, Maheshwari et al., 2022). The unified architecture supports robust generalization across scripts, genres, and annotation schemes, and its design philosophy—eschewing external tokenizers and lexica—positions it as a transferable paradigm for other under-resourced, morphologically rich languages. Its success in joint multitask settings, cross-linguistic lemmatization, and structure-sensitive generation tasks further establishes the byte-level Seq2Seq approach as foundational for modern NLP in complex language environments (Nehrdich et al., 2024, Das et al., 11 Nov 2025).

Markdown Report Issue Upgrade to Chat

References (4)

One Model is All You Need: ByT5-Sanskrit, a Unified Model for Sanskrit NLP Tasks (2024)

Accent Placement Models for Rigvedic Sanskrit Text (2025)

A Benchmark and Dataset for Post-OCR text correction in Sanskrit (2022)

Still Not There: Can LLMs Outperform Smaller Task-Specific Seq2Seq Models on the Poetry-to-Prose Conversion Task? (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ByT5-Sanskrit.

ByT5-Sanskrit: Byte-Level Model for Sanskrit

1. Model Architecture and Pretraining

2. Tokenization Strategies and Phonetic Normalization

3. Multitask, Multidomain Fine-Tuning

4. Evaluation Metrics and Empirical Results

5. Analysis: Error Patterns, Generalization, and Ablation

6. Impact and Downstream Applications

7. Significance for MRLs and General NLP

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

ByT5-Sanskrit: Byte-Level Model for Sanskrit

1. Model Architecture and Pretraining

2. Tokenization Strategies and Phonetic Normalization

3. Multitask, Multidomain Fine-Tuning

4. Evaluation Metrics and Empirical Results

5. Analysis: Error Patterns, Generalization, and Ablation

6. Impact and Downstream Applications

7. Significance for MRLs and General NLP

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research