Sanskrit Word Segmentation (SWS)
- Sanskrit Word Segmentation (SWS) is the process of restoring word boundaries in sandhied Sanskrit text by undoing phonological transformations (sandhi).
- Techniques range from lexicon-driven FST candidate generation and morphological validation to advanced transformer models achieving up to 94% perfect-match rates.
- Effective segmentation underpins downstream NLP tasks such as morphological analysis, syntactic parsing, and digital manuscript processing for robust language technology.
Sanskrit Word Segmentation (SWS) is the computational task of restoring word boundaries in sandhied Sanskrit text, where classical phonological and morphophonemic processes (sandhi) obscure or erase explicit word delimiters. Accurate SWS is crucial as a precursor to morphological analysis, syntactic parsing, information retrieval, and downstream NLP tasks. This article surveys the formal problem definition, linguistic complexity, algorithmic paradigms, evaluation standards, state-of-the-art architectures—including recent transformer innovations—and ongoing challenges in the field.
1. Linguistic Background and Problem Characterization
Written Sanskrit lacks obligatory word boundary markers; phonological rules (sandhi) at morpheme and word junctions induce insertions, deletions, fusions, and substitutions of phonemes. For example, rāma + iti → rāmeti, vidyā + ālayaḥ → vidyālayaḥ, or punaḥ + api → punarapi. Over 281 classical sandhi rules have been cataloged. This leads to three principal challenges for automatic word segmentation:
- Boundary Obfuscation: Sandhi may merge or split characters at the boundary, so word boundaries are neither explicit nor token-invariant.
- Combinatorial Explosion: A single phonological surface may admit a combinatorially large number of valid segmentations (e.g., “gardabhaśca” yields 625 candidate splits).
- Semantic Ambiguity: Many segmentations are locally valid but syntactically or semantically ill-formed in context.
SWS explicitly requires not only detecting boundaries but also “undoing” sandhi, i.e., restoring the original lexical forms and syntactic units from surface-merged phoneme strings (Sandhan et al., 2022, Sandhan, 2023).
2. Classical FST and Rule-based Segmentation
Prior to data-driven methods, the predominant SWS techniques were lexicon-driven, finite-state transducer (FST) systems, exemplified by the Sanskrit Heritage Reader (SHR) (Krishnan et al., 2020, Krishnan et al., 2020). The process is as follows:
- Candidate Generation: The FST enumerates all phonetically valid splits by inverting sandhi rules at every locus in the surface input, yielding a set of candidate chunkings.
- Morphological Validation: Each candidate chunk is checked for validity using a morphological lexicon; only splits where all chunks are valid forms survive.
- Transition and Phase Constraints: Each chunk is tagged with phase features (e.g., compound component type) and sandhi transition types.
- Ranking: Updated FST-based systems employ statistical ranking schemes. Krishnan & Kulkarni introduce a product-of-products (POP) model, assigning to each segmentation sequence a score
where , are word- and transition-probabilities computed from frequency lists (Krishnan et al., 2020).
Lexicon-driven segmenters achieve high coverage and interpretability but are brittle on out-of-vocabulary (OOV) items and lack robustness to corpus or genre variation. Typical top-1 accuracy ranges from 53.5% to 89.3% (POP ranking), and top-3 accuracy reaches 98.3% (Krishnan et al., 2020).
3. Data-Driven and Neural Models
The 2010s saw a paradigm shift to data-driven neural models that treat SWS as a sequence transduction problem:
- Sequence-to-Sequence (Seq2Seq) Models: LSTM encoder-decoder frameworks map sandhied (input) to unsandhied, space-delimited (output) sequences. Inputs and outputs are over subword units (via SentencePiece) or character sequences, with the decoder learning to emit boundaries and reverse sandhi (Reddy et al., 2018, Dave et al., 2020).
- Edit-operation Sequence Labeling: Predicts, for each character, a label (copy/insert boundary/apply sandhi repair rule), allowing joint modeling of segmentation and sandhi reversal (Li et al., 2022).
- Double Decoder Architectures: Predicts split locations and then reconstructs each constituent morpheme, decoupling the two sub-tasks (location and rewrite) (Aralikatte et al., 2018).
- Energy-Based Graph Models: Constructs a segmentation graph from all SHR-generated candidates; edge scores are learned via path ranking or deep energy-based models, supporting global inference and integration of morphological cues (Krishna et al., 2018).
Neural approaches no longer require explicit sandhi rule inversion at inference and are more robust to OOV and genre drift. SOTA F1 scores have reached over 90% for token-level boundaries (Reddy et al., 2018), with further increases using transformer components.
4. Transformer-Based and Lexicon-Augmented Approaches
Recent breakthroughs are attributed to transformer-based segmenters, in particular the hybrid strategies represented by TransLIST and pure byte-level models such as ByT5-Sanskrit and CharSS:
4.1 TransLIST: Transformer + Linguistic Lattice
- Architecture: A transformer encoder augmented with soft-masked attention (SMA), biasing attention towards SHR’s candidate spans (“latent words”) (Sandhan et al., 2022, Sandhan, 2023, Sandhan et al., 2023).
- Input: Hybrid representation combining input characters and candidate word spans from SHR or, where not available, all character n-grams up to length 4.
- Soft-Masked Attention: Attention weights encourage focus on candidate spans overlapping each query position, via learned or positional encodings.
- Path Ranking: After contextual encoding, a path-level Viterbi-style search scores all valid candidate segmentations, with a post-processing function combining model log-likelihoods and a character-level LM perplexity penalty.
- Evaluation: Achieves Perfect-Match (PM) rates up to 93.97% on the SIGHUM benchmark, a +7.2 percentage point gain over prior baselines (Sandhan et al., 2022).
4.2 Byte-Level Transformers (ByT5, CharSS)
- ByT5-Sanskrit: Pretrained byte-level T5 model, fully data-driven, fine-tuned on IAST transliterations. Achieves 90.11%, 93.83%, and 94.29% PM on DCS, SIGHUM, and Hackathon splits, respectively (Nehrdich et al., 2024).
- CharSS (ByT5 base): Fine-tuned on raw SLP1 bytes, trained with standard cross-entropy, reaches LPA 97.2/SPAcc 93.5 on UoH+SandhiKosh, and PM 93.78 (SIGHUM) (J et al., 2024).
- Lexicon Integration: Linguistically-informed prefixing—passing candidates from SHR as prompts—improves performance further, matching or surpassing TransLIST PM (J et al., 2024, Sandhan, 2023).
Compared to earlier strategies, these transformer models achieve both maximal recall (robust generalization to OOV, noise, domain), and, with hybridization, benefit from the precise hypotheses generated by linguistic analyzers.
5. Evaluation Standards and Benchmarking
SWS systems are consistently evaluated using:
- Token-level Precision/Recall/F₁: Measures alignment of predicted and gold word boundaries.
- Perfect-Match (PM): Fraction of sentences where every boundary is predicted exactly.
- Split-accuracy and Location-accuracy: Especially relevant for sandhi splitting on compounds (Aralikatte et al., 2018, Dave et al., 2020).
Typical datasets include the Digital Corpus of Sanskrit (DCS) main split (Dave et al., 2020, Reddy et al., 2018), the SIGHUM and Hackathon splits (≈90k–100k sentences) (Sandhan et al., 2022, Sandhan, 2023, J et al., 2024, Nehrdich et al., 2024), and compound-specific corpora for sandhi splitting.
Recent leaders, their main approaches, and best PM results are summarized:
| Model | Method | PM (SIGHUM) | PM (Hackathon) | PM (DCS) |
|---|---|---|---|---|
| rcNN-SS [2018] | Char RNN/CNN | 87.08% | 77.62% | 85.2% |
| TransLIST | Transformer+SHR | 93.97% | 85.47% | – |
| ByT5-Sanskrit | Byte-level Trf. | 93.83% | 94.29% | 90.11% |
| CharSS (ByT5 base) | Byte-level Trf. | 93.78% | 87.7% | – |
(Sandhan et al., 2022, Sandhan, 2023, Nehrdich et al., 2024, J et al., 2024)
6. Error Analysis, Ablations, and Open Challenges
Error analysis reveals that state-of-the-art models are limited primarily by:
- Ambiguity: Multiple grammatically valid segmentations exist for a given surface string. Data-driven models tend to select the statistically most frequent, leading to genuine disagreements with gold annotations (Nehrdich et al., 2024, J et al., 2024).
- Rare or Archaic Forms: Underrepresented splits or rare sandhi/morpheme sequences are a persistent error category.
- Sentence Length: Unlike sequential models, transformer and energy-based graph models degrade gracefully on long inputs (Sandhan et al., 2022).
- Lexical/OOV: Byte-level models are robust to new character strings, but lack explicit constraints to ensure all predicted segments are lexically attested, which hybrid architectures such as TransLIST enforce via lattice restriction (J et al., 2024, Sandhan et al., 2022).
- Component Importance: Ablation studies confirm that soft-masked attention (SMA), candidate lattice inputs (LIST), and post-hoc path ranking are all critical; removing any of these reduces PM by up to 10 points (Sandhan, 2023, Sandhan et al., 2022).
Persistent open challenges include handling extended compounds (internal sandhi chains), integrating semantic plausibility, extending to non-Devanagari scripts, and improving cross-lingual transfer, especially for low-resource Indic relatives (Sandhan et al., 2022, J et al., 2024).
7. Practical Usage, Applications, and Resources
SWS forms the backbone of modern Sanskrit NLP pipelines for morphological parsing, syntactic analysis, translation, and digital manuscript processing. Modular open-source tools and APIs (e.g., SanskritShala (Sandhan et al., 2023), TransLIST (Sandhan et al., 2022), ByT5-Sanskrit (Nehrdich et al., 2024), CharSS (J et al., 2024)) facilitate:
- Web Interfaces: Real-time segmentation via REST APIs, with user overrides and annotation modes (Sandhan et al., 2023).
- Corpus Preprocessing: Training and validating Gold-tagged resources, handling OCR noise, and segmenting for linguistic annotation (Krishnan et al., 2020, Nehrdich et al., 2024).
- Downstream NLP: Input to joint models for dependency parsing, lemmatization, machine translation, and technical content adaptation across Indian languages (J et al., 2024, Sandhan, 2023).
Best-practice workflows combine high-coverage candidate generation (rule/FST), neural re-ranking or decoding, and lexicon-informed postprocessing. Code, pretrained models, and benchmarks are generally available for all recent systems.
References:
- (Sandhan et al., 2022) TransLIST: A Transformer-Based Linguistically Informed Sanskrit Tokenizer
- (Sandhan, 2023) Linguistically-Informed Neural Architectures for Lexical, Syntactic and Semantic Tasks in Sanskrit
- (Sandhan et al., 2023) SanskritShala: A Neural Sanskrit NLP Toolkit with Web-Based Interface for Pedagogical and Annotation Purposes
- (Nehrdich et al., 2024) One Model is All You Need: ByT5-Sanskrit, a Unified Model for Sanskrit NLP Tasks
- (J et al., 2024) LEVOS: Leveraging Vocabulary Overlap with Sanskrit to Generate Technical Lexicons in Indian Languages
- (Krishnan et al., 2020) Sanskrit Segmentation Revisited
- (Krishnan et al., 2020) Validation and Normalization of DCS corpus using Sanskrit Heritage tools to build a tagged Gold Corpus
- (Aralikatte et al., 2018) Sanskrit Sandhi Splitting using seq2(seq)²
- (Reddy et al., 2018) Building a Word Segmenter for Sanskrit Overnight
- (Krishna et al., 2018) Free as in Free Word Order: An Energy Based Model for Word Segmentation and Morphological Tagging in Sanskrit
- (Dave et al., 2020) Neural Compound-Word (Sandhi) Generation and Splitting in Sanskrit Language
- (Li et al., 2022) Word Segmentation and Morphological Parsing for Sanskrit