Curriculum-GRPO: RL for Bengali Math

Updated 18 January 2026

Curriculum-GRPO is a reinforcement learning methodology that fuses curriculum design with Group Relative Policy Optimization to improve Bengali mathematical reasoning.
It leverages a rigorously tagged Ganit corpus and a composite reward structure to ensure efficient, native Bengali chain-of-thought output.
Empirical results demonstrate notable accuracy improvements, concise solutions, and robust performance across various model scales.

Curriculum-GRPO is a reinforcement learning (RL) training methodology specifically designed for low-resource language mathematical reasoning, with a focus on Bengali. The approach integrates curriculum learning principles with Group Relative Policy Optimization (GRPO), dynamically structures RL fine-tuning based on rigorously tagged problem difficulty, and applies verifiable, multi-component rewards to promote both solution correctness and native language reasoning. The method is central to the GanitLLM pipeline for Bengali mathematical reasoning, leading to substantial performance improvements and the production of compact, efficient LLMs capable of chain-of-thought problem solving in Bengali (Dipta et al., 11 Jan 2026).

1. Motivation and Foundational Challenges

Curriculum-GRPO was conceived in response to several empirical barriers in mathematical NLP for low-resource languages:

Standard RLHF and reward maximization strategies—extensively utilized in large English-centric LLM training—fail to adapt in low-resource domains due to severe reward sparsity, particularly when models are exposed to challenging ("hard" or "Olympiad") math problems without sufficient grounding in the language or problem-solving conventions.
Typical Bengali LLMs either reason in English and post-hoc translate, or collapse on multi-step math problems due to ineffective reward signals and lack of stepwise Bengali supervision.
Classical curriculum learning, where tasks are ordered from easy to hard, showed empirical promise in upstream tasks, but lacked integration with verifiable, token-level or output-structure-directed RL for code or math domains.

A direct implication is that a principled curriculum design, paired with RL that rewards both problem-specific outputs and native-language reasoning, is essential to bridge the performance and fluency gap for mathematical LLMs in Bengali (Dipta et al., 11 Jan 2026).

2. Ganit Corpus and Problem Difficulty Tagging

Curriculum-GRPO uses the Ganit Bengali Math Corpus, a rigorously filtered and decontaminated dataset of ~19,000 problems, as its foundation (Dipta et al., 11 Jan 2026):

Source Diversity: Data pools from human-annotated, human-translated, and LLM-translated mathematical problems, including DL-Sprint 3.0, mCoT-MATH-bn (580,000), NuminaMath-CoT-bn (859,000), Somadhan (8,700), and s1k-Bangla, with expert filtering for ≥95% correctness.
Filtering: Rules restrict problems to those with verifiable numeric answers, ≥99% Bengali text, and no multiple-choice format; deduplication via Levenshtein-3-gram and MinHash.
Difficulty Stratification: Problems are tagged for difficulty via automated pass@k sampling with a strong Bengali mathematical evaluator (Qwen3-32B-Instruct). For each problem, 32 completions are drawn at $T=0.7$ $T = 0.7$ ; the number $c$ $c$ of correct generations determines difficulty, bucketed as:
- Olympiad: $c \in [1,8]$
- Hard: $c \in [9,16]$
- Medium: $c \in [17,24]$
- Easy: $c \in [25,32]$

Problems unsolved by the evaluator ( $c=0$ ) are excluded. Final training splits (CoT-SFT: 11,023; RLVR: 7,328) and evaluation (Dev: 776, balanced across buckets) ensure broad difficulty coverage.

3. Multi-Stage Training Pipeline

The training methodology comprises two main phases:

Supervised Fine-Tuning (SFT): Models are first grounded via supervised learning on CoT-annotated examples in Bengali, ensuring that generated reasoning chains are natively in Bengali rather than defaulting to English.
Curriculum-GRPO Reinforcement Learning: Fine-tuning is continued by group-based RL with explicit curriculum-informed sampling. The training maintains a mix of easy-to-hard samples, empirically set to a 60% easy / 40% hard bucket distribution (a ratio derived to counteract cold-start zero-reward collapse).

Group Relative Policy Optimization is executed with LoRA adapters (rank 16, $\alpha$ 32) and 5 RL epochs. Context length is maximized (4,096 tokens for SFT, 2,500 for RL) and DAPO (Dynamic Additional Policy Optimization) loss with KL penalty ( $\beta=0.1$ ) is employed to stabilize the policy.

4. Reward Structure and Optimization Criteria

Curriculum-GRPO introduces a composite, verifiable reward function, applied per completion, that reflects the unique demands of mathematical RL in Bengali. Each completion is scored by three orthogonal reward terms:

Format Correctness ( $R_{\text{fmt}}\in\{0,1\}$ ): 1 if the generation begins with full chain-of-thought reasoning and marks the final answer with <answer> tags.
Numerical Correctness ( $R_{\text{corr}}\in\{0,1,2\}$ ): 1 for a numerically correct answer, with a bonus +1 if the answer in <answer> is rendered in Bengali numerals.
Bengali-Reasoning ( $R_{\text{bn}}\in\{0,1\}$ ): 1 if ≥80% of the characters in the reasoning portion use Unicode block U+0980–U+09FF (Bengali script).

The total reward is $R=R_{\text{fmt}}+R_{\text{corr}}+R_{\text{bn}}$ , bounded in $[0,4]$ . This reward is directly used in the policy gradient, subject to KL-regularization, and overlength responses are penalized through dynamic filtering and token-level loss adjustment (Dipta et al., 11 Jan 2026).

5. Empirical Impact and Benchmark Results

GanitLLM, trained with Curriculum-GRPO, demonstrates the following empirical outcomes:

Performance: On Bn-MGSM and Bn-MSVAMP benchmarks, GanitLLM-4B delivers +7.6 and +5.9 accuracy points over Qwen3-4B, with results of 76.8% and 76.4% respectively.
Reasoning Tokens: Proportion of Bengali script tokens in the chain-of-thought increases from 14% (base) to 88% (Curriculum-GRPO), indicating a shift to native-language reasoning.
Solution Length: Average solution length drops dramatically (943 to 193 words), showing improved conciseness.
Model Efficiency: The 4B-parameter GanitLLM outperforms the Qwen3-8B and matches Qwen3-14B, indicating efficient parameter usage, and similar gains are observed at 1.7B and 0.6B scales.
Robustness: Early exposure to "hard" samples, without curriculum pacing, leads to reward collapse; curriculum-driven sampling is essential to prevent pathological RL dynamics.

A plausible implication is that curriculum-based RL is likely to be broadly beneficial in other low-resource mathematical domains with sparse reward support and severe language-specific fluency constraints (Dipta et al., 11 Jan 2026).

6. Implementation Details and Adaptation Challenges

The implementation prescribes:

Base architectures: Qwen3 models (0.6B, 1.7B, 4B; Apache 2.0)
SFT settings: 50 epochs, batch 64, bf16, learning rate $1\times10^{-6}$ , max 4,096 tokens
RL settings: 5 epochs, LoRA adapters, rollout 8, max 2,500 tokens
Hardware: 2 × A100 GPUs
Major challenges: RL cold-start (zero reward under hard samples), English pre-training bias requiring Bengali CoT SFT, and overlength responses controlled by DAPO loss.

Curriculum-GRPO is made available at https://dipta007.github.io/GanitLLM/, with code, data splits, and checkpoints, facilitating reproducibility and extension to other languages or domains (Dipta et al., 11 Jan 2026).

7. Relationship to Broader Bengali Math and NLP Resources

Curriculum-GRPO interacts with other key Bengali math NLP datasets and pipelines:

SOMADHAN (8,792 Bengali MWPs, CoT solutions) supports supervised and LoRA-driven fine-tuning but does not implement curriculum-based RL with verifiable, multi-component rewards (Paul et al., 27 May 2025).
PatiGonit (10,000 problems): Used for seq2seq translation baselines and transformer pre-training; lacks RL pipeline or difficulty-stratification (Era et al., 5 Jan 2025).
BanglaMATH and BanglaSTEM: Serve as benchmark and translation/test resources, confirming the underperformance of general LLMs on Bengali mathematics and motivating curriculum- and reward-structured training approaches (Prama et al., 13 Oct 2025, Hasan et al., 5 Nov 2025).

This positioning underscores Curriculum-GRPO's distinctive contribution in aligning reward engineering, curriculum progression, and verifiable output structure to the demands of native-language math reasoning.