Bengali Math Corpus: Low-Resource Benchmark

Updated 18 January 2026

Bengali Math Corpus is a rigorously curated dataset of math word problems and solutions in Bengali, annotated with difficulty levels.
It employs multi-source data blending human-authored and LLM-translated problems with stringent linguistic and numeric filtering protocols.
The corpus underpins Bengali-native LLM training and curriculum-based reinforcement learning, enhancing mathematical reasoning in low-resource settings.

A Bengali Math Corpus is a rigorously curated, large-scale dataset of mathematical word problems and their solutions expressed in the Bengali language, accompanied by automatic or human-generated difficulty labels and filtered for linguistic and numeric correctness. Such corpora are foundational for developing Bengali mathematical reasoning models that can reason natively in Bengali rather than resorting to translation or code-switching. They serve as essential benchmarks and training data for low-resource mathematical language modeling, mathematical education tools, and automated assessment research.

1. Motivation and Significance

Robust Bengali math corpora have become essential due to the limitations of existing LLMs when applied to mathematical reasoning in underrepresented languages. Pretrained LLMs exhibit strong performance on English mathematical tasks, yet fail on multi-step Bengali mathematical queries, particularly when reinforcement learning pipelines optimized for high-resource languages break down under reward sparsity (Dipta et al., 11 Jan 2026). The need for high-fidelity Bengali math datasets arises from:

The desire to evaluate and advance Bengali-native mathematical reasoning abilities in LLMs.
The requirements of curriculum-based training pipelines where difficulty-aware sampling maximizes learning efficiency.
The necessity to disentangle true model reasoning capability from translation artifacts.

A comprehensive Bengali math corpus not only addresses computational linguistics research for low-resource settings but enables advancements in educational technology, automated tutoring, and assessment for the Bengali-speaking community.

2. Corpus Construction and Quality Control

The development of a Bengali Math Corpus necessitates a multi-step pipeline with stringent quality control:

2.1 Data Sources

Human-authored datasets: e.g., DL Sprint 3.0, Shomikoron, Bangla-Math.
Human- or LLM-translated versions of popular English math datasets: mCoT-MATH-bn, NuminaMath-CoT-bn, s1k-Bangla, SOMADHAN (GSM8K→bn).
LLM-translated data using systems such as NuminaMath-Bangla.
Synthetic augmentation: paraphrased variants for robustness.

2.2 Filtering and Decontamination

Retention of only those subsets achieving >95% correctness on 100-sample manual evaluations.
Rule-based linguistic filtering: ≥99% Bengali characters in the problem statement, exclusion of multiple-choice forms.
Numeric verifiability: only problems possessing definite numeric gold answers are retained.
Deduplication through fuzzy string matching (70% Levenshtein, 3-gram) and MinHash fingerprinting.
Decontamination against important evaluation sets (e.g., Bn-MGSM, Bn-MSVAMP) to eliminate test-train leakage.

2.3 Dataset Statistics

An example from the Ganit corpus (Dipta et al., 11 Jan 2026):

Dataset	Size	Human Eval (%)
mCoT-MATH-bn	580,000	100
NuminaMath-CoT-bn	859,000	97
s1k-Bangla	1,000	96
DL Sprint 3.0 (Olympiad)	200	96
SOMADHAN (GSM8K→bn)	8,700	96

This multi-source approach yields initial seed sizes in excess of 1.5 million samples, which are subsequently filtered and partitioned for training and development.

3. Difficulty-Aware Annotation

A distinguishing feature of state-of-the-art Bengali math corpora is the integration of automatic difficulty tags. The Ganit dataset employs the following methodology (Dipta et al., 11 Jan 2026):

For each problem, a high-capacity evaluator LLM (Qwen3-32B) generates 32 completions. The number of successful solutions $c\in[0,32]$ defines the problem’s difficulty.
Problems are bucketed as:
- Olympiad: $1 \leq c \leq 8$
- Hard: $9 \leq c \leq 16$
- Medium: $17 \leq c \leq 24$
- Easy: $25 \leq c \leq 32$
Problems for which $c=0$ (unsolved by all generations) are discarded from the training pool.

The resulting distributions—e.g., in Ganit-Dev: 28.74% Easy, 26.03% Medium, 24.31% Hard, 21.26% Olympiad—ensure an explicit curriculum for difficulty-aware training and robust stratified evaluation.

4. Integration with Model Training and Evaluation

High-quality Bengali math corpora underpin modern Bengali math LLMs, such as GanitLLM (Dipta et al., 11 Jan 2026) and Bangla Math AI (Tabib et al., 8 Jan 2025). Key roles include:

Supervised Fine-tuning: Step-by-step chain-of-thought annotated solutions ground the model’s reasoning in Bengali, reducing reliance on translation and code-switching.
Curriculum-based Reinforcement Learning: Difficulty-aware problem buckets control the progression of training, preventing reward sparsity and supporting robust convergence through pipelines such as Curriculum-GRPO.
Reward Signal Computation: Numeric verifiability and language-specific correctness (e.g., output in Bengali digits) are calculated programmatically for policy optimization.
Benchmarks: Corpora filtered for decontamination and difficulty serve as fair evaluation sets (e.g., Bn-MGSM, Bn-MSVAMP, Ganit-Dev).
Error Analysis: Tracked failure cases (e.g., invalid diagrams, multi-step reasoning breakdowns, numeral mistranslation) expose weaknesses in both data and model.

The result is a significant increase in Bengali reasoning token share (from ≈14% to ≈89%), a reduction in verbose, English-dominated outputs, and substantial improvements in exact-match accuracy—+7.6 and +5.9 percentage points on MGSM and MSVAMP, respectively (Dipta et al., 11 Jan 2026).

Compared to general multilingual math datasets, Bengali Math Corpora show several distinctive elements:

Fine-Grained Filtering: Emphasis on numeric, format, and linguistic correctness far exceeds general math translation efforts.
Difficulty Tagging: Automated Pass@k-derived labels enable curriculum learning, rarely found in high-resource language datasets.
Low-Resource Focus: The corpus design directly addresses reinforcement learning collapse in low-resource mathematical domains.
Integration with Tool-Integrated Reasoning (TIR): In competing architectures, such as those by Tabib & Deedar (Tabib et al., 8 Jan 2025), datasets are augmented through paraphrasing, synthetic translation, and retrieval-augmented generation (RAG) which leverages a structured Bengali math corpus both as prompt conditioning and as few-shot demonstration bank.

A plausible implication is that these methodologies could be extended to other low-resource mathematical domains where similar reward sparsity or code-switching issues hamper the development of local-language mathematical reasoning systems.

6. Limitations and Prospective Directions

Bengali Math Corpora, despite their scale and filtering, face several limitations (Dipta et al., 11 Jan 2026):

Evaluator Dependency: Difficulty buckets and correctness checking rely on a strong evaluator LLM, potentially cementing evaluator-specific biases in the corpus and downstream models.
Reward Heuristics: Character-ratio thresholds used to enforce Bengali reasoning can unfairly penalize mixed-language or transliterated solutions.
Domain Coverage: Focus is restricted to Bengali math word problems; generalization to symbolic math, diagram-based geometry, and other mathematical subfields remains open.
Human Annotation Scarcity: Automatic methods are dominant; increased human annotation, especially for Olympiad and hard buckets, may yield further improvements.

Suggested future directions include:

Developing stronger reward models that handle mixed-language and transliteration scenarios.
Expanding to diagram-assisted tasks, leveraging visual-generation multi-agent frameworks (Lee et al., 2024).
Porting methodologies to other underrepresented languages and mathematical domains.

7. Conclusion

A Bengali Math Corpus is an essential resource for the development, training, and evaluation of Bengali mathematical reasoning LLMs. The data curation, rigorous filtering, decontamination, synthetic augmentation, and difficulty-aware annotation yield a high-quality foundation upon which curriculum-guided policy optimization can flourish, raising the standard for low-resource language mathematical AI (Dipta et al., 11 Jan 2026, Tabib et al., 8 Jan 2025). The integration of these corpora with model pipelines produces state-of-the-art performance in Bengali reasoning, driving educational and research applications in the Bengali-speaking world.