Papers
Topics
Authors
Recent
Search
2000 character limit reached

GSM-Symbolic: Evaluating LLM Mathematical Reasoning

Updated 5 February 2026
  • GSM-Symbolic is a research paradigm that uses parameterized symbolic templates to generate diverse, controlled math word problems and test LLMs’ reasoning.
  • It systematically instantiates templates with varied numerical and textual inputs to measure accuracy, variance, and performance drops under subtle problem changes.
  • The approach highlights LLMs' fragility by introducing clause modifications and distractor content, revealing a reliance on pattern matching over true symbolic reasoning.

GSM-Symbolic denotes a research paradigm and corresponding benchmark framework designed to evaluate and expose the structural limitations of mathematical reasoning in LLMs. Its methodology introduces symbolic templates that parameterize and systematically generate diverse, controlled variations of grade-school mathematical word problems, thus allowing for the precise measurement of reasoning robustness, variance, and fragility under minimal shifts in problem surface form, content, or logical complexity. GSM-Symbolic is centered on benchmarking the ability of LLMs to generalize their mathematical reasoning beyond memorized patterns, testing for true symbolic manipulation as opposed to superficial in-distribution pattern matching (Mirzadeh et al., 2024).

1. Symbolic Template Construction and Parameterization

The core innovation of GSM-Symbolic lies in the creation of symbolic templates from existing natural-language math questions (originally from GSM8K). Each question is abstracted into a parametrized skeleton that replaces concrete values, names, and objects with placeholders corresponding to:

  • Variables (V={v1,,vk}V = \{v_1, \dots, v_k\}), e.g., xx (number of blocks), yy (number of animals), zz (number of rings), totaltotal, and ansans.
  • Domains DiD_i for each variable viv_i. Name variables are drawn from finite sets (e.g., sample(["Sophie", "Oliver", ...])), categorical variables (such as family member types) from their discrete categories, and numerical values from specified uniform integer ranges, such as x,y,zUniformInt[5,100]x, y, z \sim \text{UniformInt}[5,100], totalUniformInt[100,500]total \sim \text{UniformInt}[100,500], ansUniformInt[85,200]ans \sim \text{UniformInt}[85,200].
  • Constraints CC to enforce arithmetic consistency among variables, such as x+y+z+ans=totalx + y + z + ans = total.
  • Skeleton Text T0T_0—a natural language template into which sampled variables are substituted.

Template instantiation procedurally samples values for each variable from its domain while checking for satisfaction of constraints, thereby ensuring arithmetic validity. Placeholders in T0T_0 are replaced with sampled values to yield a concrete math problem and its unique solution.

2. Workflow and Dataset Generation

The GSM-Symbolic pipeline comprises several structured phases:

  1. Template authoring: Manual conversion of GSM8K questions to skeleton format, with explicit VV, DiD_i, and CC definitions.
  2. Automated instantiation: Each template is sampled independently N=50N = 50 times, yielding distinct but structurally-equivalent questions via random variable assignment subject to all constraints.
  3. Quality control: Instantiations are validated to prevent leakage of original values, ensure well-posedness, and confirm solution consistency (including random manual review).
  4. Clause-count variants: Problem complexity is modulated:
    • M0M_0 (base): original clause count.
    • M1M_{-1}: one clause removed.
    • P+1P_{+1}, P+2P_{+2}: one or two additional (semantically relevant) clauses inserted.
  5. No-Op variants: Addition of a single, irrelevant but plausible clause to test distractor fragility.

Each of 100 templates yields 50 instantiations for each variant, producing 5,000 examples per type, allowing precise large-sample statistical analysis and fine-grained ablations.

3. Evaluation Metrics and Statistical Analysis

GSM-Symbolic introduces robust metrics for performance assessment:

  • Accuracy per model MM on benchmark D={Q1,,Qn}D = \{Q_1, \dots, Q_n\}: A(M,D)=1ni=1n1{M(Qi)=ansi}A(M, D) = \frac{1}{n} \sum_{i=1}^{n} \mathbf{1}\{M(Q_i) = ans_i\}.
  • Variance and sample mean: Across $50$ independent DjD_j each of size $100$, reporting the sample mean μ\mu and sample variance s2s^2.
  • Performance drop: Δ=A(M,GSM8K)μ\Delta = A(M,\text{GSM8K}) - \mu quantifies the decrease relative to original GSM8K performance.
  • Statistical significance: One-sample t-tests assess if observed drops from GSM8K are significant (p<0.05p < 0.05).

This design distinguishes between variance induced by superficial variable name changes, value reassignment, and logical (clause) perturbations, supporting detailed error decomposition.

4. Fragility Experiments: Clause and Distractor Analysis

To isolate the causes of model brittleness, GSM-Symbolic systematically manipulates question complexity and irrelevant content injection:

  • Clause Counting Fragility: Problems are generated with varying clause counts (removal D1D_{-1}, base D0D_0, addition D+1D_{+1}, D+2D_{+2}). Performance is sharply non-linear: each added clause decreases accuracy by ~10 percentage points (pp); two clauses by ~30 pp.
  • No-Op Distractor Fragility: Addition of a single irrelevant clause ("NoOp")—syntactically plausible but logically unused in solution steps—provokes catastrophic accuracy drops, e.g., Phi-3-mini: from 88.0% to 22.4% (Δ ≈ 65.6 pp), GPT-4o: from 95.2% to 63.1% (Δ ≈ 32.1 pp).

Further, name-only ablations yield moderate variance (σ ≈ 2–6 points), while value-only resampling leads to greater variance (σ ≈ 3–8 points) and substantial mean drops (5–12 pp).

Variant Performance Drop (typical) Variance (σ)
Names only ~1–2 pp 2–6
Values only ~5–12 pp 3–8
Add 1 clause ~10 pp increases
Add 2 clauses ~30 pp increases
No-Op distractor (Phi) ~65.6 pp spike

5. Key Findings on LLM Reasoning Mechanisms

GSM-Symbolic establishes that current LLM mathematical reasoning is best characterized as in-distribution pattern matching, not genuine symbolic reasoning:

  • High sensitivity to value renaming: Minor changes in numerical values, even with constant underlying structure, lead to significant swings in accuracy (>10 pp)—behavior inconsistent with true mathematical abstraction.
  • Conditional inconsistency: The probability that a model answers both the original and renamed/reshuffled problem correctly is much less than one, contradicting expectations for logically consistent reasoning.
  • Misinterpretation of irrelevant information: No-Op clauses, even when logically inert, are often incorrectly integrated into the solution steps, indicating models’ inability to filter non-contributing text.
  • Super-linear degradation with problem complexity: Accuracy collapses far more rapidly than would be expected from increased problem length/complexity, with small increases in clause count resulting in steep performance drops.

Collectively, these outcomes imply that LLMs overfit to superficial surface forms and fail to construct internal, verifiable proof chains characteristic of symbolic manipulation.

6. Extensions: Quasi-Symbolic Pipelines and Implications

Subsequent work leverages GSM-Symbolic as an adversarial benchmark for evaluating new reasoning paradigms, such as QuaSAR (Quasi-Symbolic Abstract Reasoning), which combines stepwise abstraction, symbolic variable instantiation, semi-formal formula composition, and explicit answer grounding. On GSM-Symbolic, introducing quasi-symbolic prompts improves both accuracy and robustness relative to standard chain-of-thought methods (e.g., GPT-4o shifts from 94.5% to 96.5% in accuracy, showing smaller accuracy drops under adversarial perturbations) (Ranaldi et al., 18 Feb 2025). Step ablation studies reveal that formalisation and explicit abstraction steps are crucial for performance and verification, and that removing such structure degrades consistency by several percentage points.

A plausible implication is that enhanced symbolic interleaving in prompting pipelines addresses some, but not all, of the fragilities exposed by GSM-Symbolic; ultimate robustness may require mechanisms for automatic irrelevant information filtering and stepwise proof verification.

7. Significance and Future Directions

GSM-Symbolic has reframed the evaluation of mathematical reasoning in LLMs by foregrounding the necessity of controlled, parameterized, and stress-tested benchmarking. By establishing that SOTA LLMs systematically fail under syntactic and semantic perturbations that do not alter problem logic, GSM-Symbolic calls into question the reliability of progress measured solely by surface-form benchmarks like GSM8K.

Future research trajectories suggested by these findings include: designing models and pipelines capable of explicit causal inference and formal reasoning, integrating symbolic solvers for modular subproblem resolution, and further adversarially probing the boundaries of LLM abstraction and compositionality (Mirzadeh et al., 2024, Ranaldi et al., 18 Feb 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to GSM-Symbolic Approach.