GanitLLM: Bengali Mathematical Reasoning LLM
- GanitLLM is a Bengali LLM for mathematical reasoning that employs chain-of-thought supervised fine-tuning to generate native, multi-step solutions.
- It integrates Group Relative Policy Optimization with LoRA and curriculum-based difficulty sampling to optimize both numeric accuracy and language fidelity.
- The model is trained on a rigorously curated Bengali math corpus, achieving state-of-the-art improvements in accuracy and reducing verbose, translation-based outputs.
GanitLLM is a Bengali mathematical reasoning LLM and training corpus, jointly designed to advance mathematical problem solving in the Bengali language by introducing a rigorous difficulty-aware dataset and a curriculum-based policy optimization pipeline. It is built on the Qwen3-series transformer models and achieves state-of-the-art performance on multi-step Bengali math tasks by explicitly optimizing for native language reasoning fidelity, concise stepwise explanations, and reward-sparse generalization in a low-resource setting. GanitLLM marks the first demonstration, at small and mid-scale parameter counts, of substantial improvements in both end-to-end accuracy and the proportion of native-language tokens for a low-resource mathematics LLM (Dipta et al., 11 Jan 2026).
1. Model Architecture and Modifications
GanitLLM is constructed atop the Qwen3 family of transformer LLMs, initially centered on the Qwen3-4B variant but demonstrated across smaller (0.6B, 1.7B) and larger (8B–32B) configurations. Three principal architectural and training augmentations are introduced:
- Chain-of-Thought Supervised Fine-Tuning (CoT-SFT): The model is fine-tuned on stepwise Bengali mathematical explanations, facilitating alignment of its chain-of-thought (CoT) outputs with native-language reasoning patterns.
- Group Relative Policy Optimization (GRPO) with LoRA: GRPO reinforcement learning is adapted to low-resource settings by incorporating Low-Rank Adaptation (LoRA) for parameter-efficient fine-tuning, targeting both numerical correctness and Bengali language fidelity.
- Curriculum-Based Difficulty-Aware Sampling: Training is scheduled from easiest to hardest problems using difficulty labels, addressing the sparse-reward challenge in low-resource domains.
This design enables GanitLLM to produce concise, Bengali-native, multi-step solutions, in contrast to earlier LLMs that predominantly reason in English or generate verbose, translation-based traces (Dipta et al., 11 Jan 2026).
2. Bengali Math Corpus Construction
The Ganit corpus underpins GanitLLM's training pipeline and is assembled through a multi-stage process:
- Seed Data Sources: Aggregates ∼1.5 million Bengali math items from sources such as SOMADHAN, DL Sprint 3.0, mCoT-MATH-bn, NuminaMath-CoT-bn, s1k-Bangla, and community contributions.
- Manual and Rule-Based Filtering: Only items with >95% human-verified correctness are retained; filtering eliminates multiple-choice and non-numeric-answer samples, and ensures ≥99% Bengali script content.
- Deduplication and Decontamination: Two-stage fuzzy string matching (3-gram, 70% threshold) and MinHash (50% threshold) remove duplicates and prevent leakage from evaluation benchmarks.
- Difficulty Tagging (pass@k): Each item is solved by Qwen3-32B (32 samples, T=0.7). Problems are binned by number of successes (k): Easy (25–32), Medium (17–24), Hard (9–16), Olympiad (1–8), producing objective and automatic difficulty buckets.
Data splits for training, RL-verification, and held-out development set are configured to balance difficulty, ensuring robust generalization beyond trivial or memorized items (Dipta et al., 11 Jan 2026).
3. Curriculum-GRPO Training Pipeline
The training procedure is a two-stage pipeline:
- CoT-SFT Stage: Full-parameter supervised fine-tuning (50 epochs, batch size 64, learning rate ) on step-annotated Bengali math data, targeting explicit, interpretable CoT solutions.
- GRPO Reinforcement Learning: LoRA-based policy optimization (rank 16, , 5 epochs, group size 8, DAPO loss, temperature 1.0) further adapts the model, informed by token-level length regularization and output formatting constraints.
Curriculum-based sampling proceeds from easy to hard: 60% of samples per batch are drawn from the current difficulty bucket, 40% from others, with a soft ordering to maintain reward signal while preventing overfitting (Dipta et al., 11 Jan 2026).
Reward Function:
Each solution receives a reward :
where (in practice ):
- : correct <answer> tag formatting
- : correct numeric answer (+1 bonus if in Bengali)
- : ≥80% of reasoning characters in Bengali script
This explicit, verifiable reward structure fosters solutions that are factually, linguistically, and formally correct (Dipta et al., 11 Jan 2026).
4. Evaluation, Benchmarking, and Performance
Evaluation is conducted on Bn-MGSM and Bn-MSVAMP, with metrics spanning exact accuracy, average solution length, and Bengali script percentage. Results (Qwen3-4B base vs. GanitLLM-4B):
| Metric | Qwen3-4B | GanitLLM-4B | Δ |
|---|---|---|---|
| Bn-MGSM Accuracy | 69.2% | 76.8% | +7.6 |
| Bn-MSVAMP Accuracy | 70.5% | 76.4% | +5.9 |
| Avg. Words/Solution | 943 | 193 | –79.5% |
| Bengali Reasoning Tokens | 14.8% | 88.7% | +73.9 |
Smaller variants (1.7B, 0.6B) outperform their Qwen3 counterparts and reach >50% accuracy on Bn-MSVAMP, demonstrating scalability and effectiveness in low-resource, low-parametric regimes (Dipta et al., 11 Jan 2026).
5. Comparative Analysis, Insights, and Limitations
Several ablations and analysis points are reported:
- CoT-SFT vs. GRPO: SFT alone yields high Bengali fidelity but modest accuracy; GRPO boosts overall accuracy but reverts to verbose, partial-English reasoning. Their combination achieves both high task accuracy and Bengali-native, concise outputs.
- Curriculum Efficiency: Curriculum-GRPO converges 3.8–5.6× faster than vanilla GRPO, mitigating the "cold-start" reward collapse typical when sampling hard, unlearnable problems early in RL training.
- Robust Generalization: Improvements are observed across all held-out difficulty buckets. The approach generalizes to unseen, challenging Bengali math problems beyond the specific data used for tuning.
Limitations include:
- Language-specific recipes: The methods are optimized for Bengali, and their cross-linguistic portability is not established.
- Heuristic thresholds: Difficulty tagging (pass@k) and Bengali proportion criteria may introduce subtle biases.
- Reward function simplicity: More nuanced or learned difficulty/reasoning metrics could further advance model robustness.
- Dynamic curricula and adaptive reward scaling remain unexplored and are identified as future research directions (Dipta et al., 11 Jan 2026).
6. Significance and Broader Impact
GanitLLM is the first Bengali LLM to demonstrate substantial (+7–8 point) accuracy improvements over strong base models while nearly eliminating verbose English translation artifacts and achieving >88% Bengali-native reasoning tokens. It addresses key challenges unique to low-resource mathematical reasoning, such as reward sparsity, format adherence, and language fidelity, through a principled, reproducible training and evaluation pipeline. GanitLLM’s methodological contributions—including curriculum-based GRPO and verifiable reward design—inform not only Bengali NLP but also broader efforts in multilingual and low-resource LLM alignment (Dipta et al., 11 Jan 2026).