L3Cube-MahaNLP: Marathi NLP Ecosystem
- L3Cube-MahaNLP is an open-source ecosystem for Marathi NLP that consolidates extensive datasets, transformer models, and a modular Python library for research and deployment.
- Datasets span from monolingual corpora to code-mixed texts, supporting tasks like sentiment analysis, NER, QA, and news classification with rigorous benchmarks.
- The framework leverages efficiency techniques such as pruning, distillation, and domain adaptation to enable sustainable, high-performance NLP deployments.
L3Cube-MahaNLP is an open-source ecosystem encompassing large-scale datasets, pretrained and task-specific transformer models, and a modular Python library for Marathi Natural Language Processing. Conceived to address the persistent resource scarcity for Marathi—a major Indic language—MahaNLP integrates comprehensive data curation, efficient modeling, and practical deployment strategies, with a focus on both end-user usability and extensible research infrastructure. It is underpinned by the L3Cube Pune mentorship program and is continuously expanded to support an increasing range of NLP tasks pertinent to the Marathi linguistic landscape.
1. Foundational Datasets and Benchmarks
The L3Cube-MahaNLP framework systematically targets the full spectrum of supervised and unsupervised NLP tasks through meticulously curated datasets:
- MahaCorpus: The core monolingual corpus for pretraining consists of 24.8 million sentences (289M tokens), expanded further to 57.2M sentences (752M tokens) by incorporating additional sources. Newswire and web texts are combined, deduplicated, Unicode-normalized, and cleaned to ensure broad lexical and topical coverage (Joshi, 2022).
- MahaSent/MahaHate/MahaNER/MahaParaphrase: Supervised datasets for sentiment (MahaSent: 15.8K tweets), hate/offense (MahaHate: 25K tweets, 4 labels), NER (MahaNER: 25K sentences, 8 entity types in both IOB and non-IOB labeling), and paraphrase detection (MahaParaphrase: 8K sentence pairs in P/NP buckets) (Joshi, 2022, Velankar et al., 2022, Patil et al., 2022, Jadhav et al., 24 Aug 2025).
- MahaNews: The largest supervised news classification resource (108,643 records across headline, paragraph, and full-document variants with 12 fine-grained categories) facilitates research into length-dependent behaviors of transformer models (Mittal et al., 2024).
- MahaSQuAD: A full-scale SQuAD 2.0 translation (118,516/11,873/11,803 for train/dev/test; plus a 500-sample gold test set) with rigorous span mapping and answer alignment, enabling extractive QA in Marathi (Ghatage et al., 2024).
- Code-mixed and Social Media Resources: MeCorpus (10M Marathi-English sentences), MeSent, MeHate, MeLID, and MahaSocialNER datasets support code-mixing and informal-domain processing (Chavan et al., 2023, Chaudhari et al., 2023). HateEval-Mr and SHC/LDC/LPC news splits support benchmarking for hate detection and domain adaptation.
These datasets underpin the rigorous evaluation of monolingual, bilingual, and multilingual models, as well as code-mixed pipelines. Annotation protocols emphasize majority voting, inter-annotator adjudication, and, when possible, domain-specific guideline adaptation. For stopword curation, TF-IDF-based frequency analysis combined with human assessment yields a validated 400-word list integrated into preprocessing (Chavan et al., 2024).
2. Model Architectures, Pretraining, and Efficient Inference
L3Cube-MahaNLP develops and evaluates a diverse ensemble of transformer- and shallow-model architectures:
- Monolingual Transformers: MahaBERT (base BERT, 12×768, ~110–238M params), MahaRoBERTa, MahaALBERT, and GPT-2 archetype (MahaGPT) are pretrained on MahaCorpus via masked or causal language modeling. Monolingual models systematically outperform multilingual baselines (mBERT, IndicBERT, XLM-RoBERTa, MuRIL) across classification, NER, and QA, with gains of 0.5–3% in macro-F₁ or accuracy (Joshi, 2022, Joshi, 2022, Patil et al., 2022, Ghatage et al., 2024).
- Code-mixed and Tweet-Domain Models: MeBERT, MeRoBERTa (for code-mixed text) and MahaTweetBERT (on 40M tweets) employ further MLM adaptation, achieving high macro-F₁ on hate detection and sentiment (Chavan et al., 2023, Gokhale et al., 2022).
- Shallow Baselines: FastText-based CNN/LSTM/BiLSTM architectures serve as computationally efficient alternatives for all tasks. Subword tokenization with BERT tokenizers, when injected into shallow models, yields measurable F₁ improvements (CNN + MahaBERT-tokenizer: +2.6 over word-based CNN for NER) (Chaudhari et al., 2023).
- Paraphrase and QA Models: Pairwise transformers (CLS–SEP–input) are fine-tuned with cross-entropy for paraphrase detection; QA models fine-tuned on MahaSQuAD leverage span prediction with softmax over context positions (Ghatage et al., 2024, Jadhav et al., 24 Aug 2025).
- Pruning, Distillation, and Mixed Precision: Efficiency optimizations combine block movement pruning (typically 25–50% sparsity), knowledge distillation (teacher: full MahaBERT, student: pruned/dense), and AMP-based mixed-precision training. On MahaNews, 25% pruning plus distillation yields 2.56× speedup, 55% CO₂e emission reduction, and ≥99.7% baseline accuracy with only a 0.25% drop (Mirashi et al., 2024).
| Model Variant | Params (M) | Speedup × | Accuracy (%) | CO₂e (kg) | Pruning | KD |
|---|---|---|---|---|---|---|
| Baseline | 238 | 1.00 | 92.43 | 0.00614 | - | - |
| 25% prune + distil | 223 | 2.56 | 92.18 | 0.00346 | 25% | yes |
| 50% prune + distil | 209 | 2.32 | 91.49 | 0.00330 | 50% | yes |
| 75% prune | 195 | 2.25 | 90.11 | 0.00134 | 75% | no |
The trade-off curve enables SLA-driven deployment: up to 50% block sparsity with distillation maximizes efficiency while retaining robust accuracy (Mirashi et al., 2024).
3. Task Coverage and Evaluation Protocols
The mahaNLP ecosystem provides standardized pipelines and benchmarks for:
- Sentiment Analysis: Three-way (+1, 0, -1) classification leveraging MahaSent, tested via macro-F₁/accuracy.
- Hate Speech Detection: Four-way and binary (hate vs. other) using MahaHate, HateEval-Mr, and HASOC, evaluated with per-class F₁, confusion matrices, and error diagnostics. Domain adaptation and annotation diversity (hateful, non-hateful, random pretraining subsets) are empirically compared (Gokhale et al., 2022, Velankar et al., 2022).
- Named Entity Recognition: 8-type IOB/non-IOB schemes on MahaNER and MahaSocialNER; transformers surpass shallow models by ~6 F₁, and fine-tuning on social NER is critical for informal/colloquial data (Chaudhari et al., 2023, Patil et al., 2022, Chaudhari et al., 2023).
- News Topic Classification: SHC, LPC, LDC in MahaNews offer 12-way headline–paragraph–article tasks; MahaBERT leads multilingual baselines by up to 2% macro-F₁ (Mittal et al., 2024).
- Paraphrase Detection: Out-of-domain and lexical-overlap bucketed evaluation of sentence pairs, with MahaBERT achieving 88.7% F₁ (Jadhav et al., 24 Aug 2025).
- QA: EM, F₁, and BLEU on MahaSQuAD; span alignment and translation-specific challenges addressed by a string similarity algorithm for cross-lingual answer mapping (Ghatage et al., 2024).
- Information Retrieval: TF-IDF and stopword curation, validated on downstream classification and sentiment benchmarks (Chavan et al., 2024).
All metrics are provided as reproducible Python scripts and/or integrated Hugging Face pipelines.
4. Software Architecture and Integration
The mahaNLP Python library is partitioned into user-centric “standard flow” APIs (simple, high-level) and “model flow” practitioner APIs (with full Hugging Face/torch control). Key modules and usage patterns:
- Preprocessing: Dedicated modules for tokenizer, sentence segmentation, Unicode normalization, contraction handling, and stopword removal based on the curated list (Magdum et al., 2023, Chavan et al., 2024).
- Dataset and Model Loaders: Datasets exposed via Pandas/🤗 Datasets interfaces. Model wrappers expose task-specific methods and device/batch configuration.
- ML Pipelines: SentimentAnalyzer, HateModel, NERModel, GPTModel, MaskFillModel class wrappers. Examples span both plug-and-play utility and advanced hyperparameter tuning/configuration (Magdum et al., 2023).
- Extensibility: New tasks/modules added by subclassing TransformerPipeline; all source code released at https://github.com/l3cube-pune/MarathiNLP.
5. Domain Adaptation, Cross-Lingual, and Code-Mixed Modeling
MahaNLP is explicitly designed for robust realistic usage:
- Code-mixed: MeCorpus, MeSent, MeHate, MeLID and the associated MeBERT/MeRoBERTa models handle mixed Marathi–English text, encompassing both Devanagari and Roman script. They outperform generic and even abusive/pseudolabel-based MuRIL and IndicBERT baselines (Chavan et al., 2023).
- Domain transfer: Social NER (MahaSocialNER) and hate benchmarks confirm that zero-shot performance on informal registers is substantially degraded unless in-domain data is included in fine-tuning (Chaudhari et al., 2023).
- Paraphrase and QA transfer: Span alignment and cross-lingual techniques provide a general recipe for porting English resources to Marathi (and other Indic languages) (Ghatage et al., 2024).
6. Impact, Best Practices, and Future Directions
L3Cube-MahaNLP sets the state of the art for Marathi NLP by providing a unified resource and benchmarking hub. Monolingual transform models consistently surpass multilingual analogs due to vocabulary fit, domain adaptation, and parameter focus; code-mixed models leverage both scripts and lead in cross-lingual domains.
- Best practices: Prefer monolingual MahaBERT/RoBERTa for in-domain tasks; employ shallow+tokenizer hybrids for low-latency/edge; prune/distil for efficient deployment; validate preprocessing via downstream benchmarks.
- Sustainability: Efficiency techniques can cut compute cost and carbon footprint by over 50% at trivial accuracy loss (Mirashi et al., 2024).
- Extensibility: Framework enables easy expansion—other Indic languages, longer input modeling, joint multitask learning, and semi-supervised or crowd-sourced data growth.
The platform’s release, permissive licensing, and Hugging Face integration lower the entry barrier for both academic and applied research in Marathi NLP, laying the groundwork for continued development in machine translation, dialog systems, summarization, and low-resource cross-lingual approaches (Magdum et al., 2023, Mittal et al., 2024).