DISTRACTMATH-BN: Robust Bangla MWP Benchmark

Updated 18 January 2026

DISTRACTMATH-BN is a benchmark that augments Bangla mathematical word problems with semantically coherent yet computationally irrelevant distractors to test model robustness.
It introduces the †DAGGER model, which uses explicit graph generation and distractor supervision through SFT and GRPO to filter noise and improve inference efficiency.
Empirical results demonstrate that structured, graph-based reasoning can greatly reduce token usage and error rates compared to standard chain-of-thought methods.

DISTRACTMATH-BN is a benchmark designed to systematically evaluate the robustness of mathematical reasoning models to semantically coherent but computationally irrelevant information—termed "distractors"—in Bangla mathematical word problems (MWPs). This resource supports the development and assessment of architectures that seek to solve math MWPs in the presence of noisy, plausible natural language context, providing critical insight into the fragility of current chain-of-thought (CoT) prompting approaches and enabling the study of methods for enhanced inference efficiency and reliability in low-resource settings (Nazi et al., 11 Jan 2026).

1. Benchmark Construction and Dataset Characteristics

DISTRACTMATH-BN is constructed by augmenting two foundational Bangla MWP datasets: MGSM-BN (250 problems, single- and multi-step arithmetic) and MSVAMP-BN (1,000 problems, unit conversions and multi-operation reasoning). Each original problem is enriched with artificially generated distractors following three distinct categories:

Related Entity Distractors (RED): Numerical facts about the same entity type but referring to different individuals or objects.
Orthogonal Attribute Distractors (OAD): Additional attributes of the main entities, frequently expressed in incompatible units (e.g., time versus weight).
Null-Effect Event Distractors (NEED): Actions or events explicitly marked as having a net-zero effect on the solution.

Distractor sentences are inserted at various points within the problem statement, ensuring (1) syntactic and semantic coherence, (2) computational irrelevance (the ground-truth solution function $f$ is unchanged: $f(S') = f(S)$ ), and (3) the presence of negation or unit-mismatch indicators as required by distractor type. All augmentations undergo automated verification (via GPT-4.1) and human review to guarantee quality.

Data statistics:

MGSM+Distractors: 738 problems, avg. 2.76 distractors/question, avg. 106.2 tokens/question
MSVAMP+Distractors: 2,947 problems, avg. 2.68 distractors/question, avg. 112.5 tokens/question ("Tokens" denotes Bangla word-piece tokens.)

2. Explicit Distractor-Aware Graph Generation Model (†DAGGER)

†DAGGER recasts each MWP as an executable directed acyclic graph (DAG) $g = (V, E)$ , where nodes represent constants, operations, or distractors, and edges denote computational dependencies:

Node attributes:
- Unique id, $\mathrm{op}(v)$ specifying the operation (e.g., $\{\mathrm{const},\mathrm{add},\mathrm{sub},\ldots\}$ ),
- $\mathrm{val}(v)$ (for constants),
- Boolean $\mathrm{distractor}(v)$ flag,
- Optional natural-language label.

A final answer node, $v_{\text{final}}$ (with operation $\mathrm{identity}$ ), is guaranteed in each graph. Graphs are serialized for model consumption and can be executed by a topological interpreter.

†DAGGER employs a transformer encoder–decoder to model $p_\theta(g|Q)$ as a sequential process, generating each node (its op/args/value/distractor flag) conditioned on the question $Q$ and previous subgraph. Explicit distractor modeling is achieved by supervising the $\mathrm{distractor}$ attribute during training and reinforcing correct filtering using policy optimization; graphs that inadvertently propagate distractors into the answer path are penalized.

3. Training Regime: SFT and Group Relative Policy Optimization

†DAGGER uses a two-phase training approach:

Supervised Fine-Tuning (SFT):
- Trained on 3,000 non-distractor MWPs (1.5k from SOMADHAN, 1.5k from NuminaMath-CoT-BN) annotated with "gold" solution graphs.
- Objective: maximize graph likelihood, $\mathcal{L}_{\mathrm{SFT}} = -\sum_{(Q, g^*)} \log p_\theta(g^*|Q)$ .
- Implemented with Gemma-3 (4B/12B), LoRA adapters, AdamW optimizer, and standard batch/cosine LR.
Group Relative Policy Optimization (GRPO):
- Treats graph generation as a reinforcement learning problem, maximizing a composite reward: $R(g, y) = 0.5\,\mathbf{1}_{\text{fmt}(g)} + 0.5\,\mathbf{1}_{\text{exec}(g)} + \mathbf{1}_{\text{acc}(\text{exec}(g), y)}$ where binary indicators require valid graph format, successful execution, and answer correctness.
- Initialized from SFT checkpoint. Eight candidate graphs per prompt are generated (top- $p$ sampling), and a binary-normalized policy-optimization (BNPO) loss is applied with groupings to stabilize gradients.

Crucially, neither SFT nor GRPO is trained on distractor-augmented examples; robustness to DISTRACTMATH-BN emerges from the inductive bias provided by explicit graph and distractor modeling.

4. Empirical Results and Comparative Evaluation

Model categories tested:

Standard CoT LLMs: Qwen 2.5 (3B), Qwen 3 (4B/8B), LLAMA 3 (7B), Gemma-3 (4B/12B).
Reasoning-specialized variants (GRPO-trained) for Qwen 3 and Gemma-3.
+DAGGER: Graph generation models (Gemma-3 4B/12B, SFT→GRPO).

Core findings:

Accuracy drop under distractors:
- Standard CoT models exhibit 18.1–40.7 percentage point drops (e.g., Qwen 2.5, 53.7%→13.0% on MSVAMP).
- Reasoning-specialized models degrade by 14.2–19.9 points, despite producing $\sim$ 5.2× more tokens.
- RED distractors induce the most severe errors (up to 94% error rate).
†DAGGER performance:
- Gemma-3-12B SFT→GRPO achieves 69.4% weighted accuracy on DISTRACTMATH-BN, close to the best reasoning-specialized model (71.4%), while emitting only 359 tokens per problem (an 89% reduction in output token cost).
- Zero-shot †DAGGER (with GPT-4.1 few-shot prompts) outperforms CoT; cross-lingual gains are observed in Telugu and Thai.
Robustness without distractor-specific training:
- †DAGGER's ability to filter distractors and maintain high accuracy is attributed to the structured graph representation and explicit distractor supervision, not to exposure to noisy contexts during training.

5. Analysis: Ablations, Efficiency, and Broader Implications

Effect of GRPO Initialization

GRPO effectiveness depends strongly on initialization: SFT→GRPO consistently outperforms training GRPO from scratch (Base→GRPO), with up to 15.6 percentage point gains under distractors (notably with Gemma-3 4B). SFT is essential to establish a valid solution-graph prior, enabling subsequent policy optimization to refine semantic parsing and distractor filtering.

Token Complexity Scaling

A marked linear correlation ( $R^2 \approx 0.43$ –0.60) is observed between output token count and graph-operation complexity in +DAGGER. By contrast, free-form CoT—lacking explicit structure—exhibits near-zero correlation and high variance. This confirms that +DAGGER's output token length reflects genuine computational need instead of verbosity, yielding dramatic inference efficiency gains.

Key Findings and Limitations

Structured intermediate representations with explicit distractor modeling are robust to contextually plausible noise in Bangla MWP solving.
Token efficiency renders +DAGGER up to 89% cheaper at inference than extended CoT reasoning.
The paradigm generalizes zero-shot to other low-resource languages and model sizes.
Persisting challenges include semantic disambiguation of entity-level distractors and extending the method to higher domains such as algebra and geometry.

DISTRACTMATH-BN fills a critical gap by quantifying the vulnerability of open-weight LLMs—both generic and reasoning-specialized—to distractor noise in mathematical reasoning, particularly for low-resource languages. It establishes that current CoT paradigms are susceptible to large performance drops and highlights the efficacy of explicit, executable computational graph generation over free-form text solution modeling.

†DAGGER’s inductive bias and explicit distractor labeling approach are complementary to techniques studied in multilingual and MCQ distractor generation benchmarks. For new MCQ datasets, approaches such as variational error modeling in DiVERT and retrieval-augmented in-context learning provide empirical guides for adaptation (Fernandez et al., 2024, McNichols et al., 2023), but these have not demonstrated the same degree of distractor robustness as +DAGGER in the Bangla MWP setting.

7. Significance for Robust Mathematical Reasoning

By providing a large-scale, systematically distracted Bangla MWP benchmark and demonstrating the critical limitations of CoT prompting, DISTRACTMATH-BN establishes an essential resource for robust evaluation and model design. The findings support a major shift towards explicit structured reasoning—such as executable graph generation models with integrated distractor annotation—as key to overcoming the fragility of natural language-based mathematical inference, especially in data-sparse, low-resource, and noisy environments (Nazi et al., 11 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (3)

†DAGGER: Distractor-Aware Graph Generation for Executable Reasoning in Math Problems (2026)

DiVERT: Distractor Generation with Variational Errors Represented as Text for Math Multiple-choice Questions (2024)

Automated Distractor and Feedback Generation for Math Multiple-choice Questions via In-context Learning (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DISTRACTMATH-BN.