MathCanvas-Bench: Multimodal Math Evaluation
- MathCanvas-Bench is a multimodal evaluation suite that tests LMMs on integrated visual and textual mathematical reasoning using 3,000 curated problems across diverse disciplines.
- It employs advanced automated grading with GPT-4.1, using metrics like Complete Accuracy and Weighted Score to ensure comprehensive evaluation of solution steps.
- The benchmark supports cross-domain generalization and strategic reasoning improvements, driving actionable insights for enhancing visual chain-of-thought in LMMs.
MathCanvas-Bench is a rigorous multimodal evaluation suite designed to quantify the capabilities of Large Multimodal Models (LMMs) in generative, interleaved visual-textual mathematical reasoning tasks. Within the MathCanvas framework, MathCanvas-Bench occupies a pivotal role by enabling standardized benchmarking of models on problems demanding both textual and strategic diagrammatic solution steps. It consists of 3,000 problems covering a broad spectrum of mathematical disciplines, each constructed to require models to produce solutions that judiciously interleave explanatory text and visual diagrams, reflecting expert human problem-solving in domains—such as geometry—where visual aids are intrinsic to reasoning.
1. Construction and Design Principles
MathCanvas-Bench is sampled from a curated pool of 222,000 multimodal math problems (extracted from textbooks, exams, and web sources) that have undergone extensive filtering for image relevance, LaTeX standardization, and answer completeness. Multiple-choice questions are systematically omitted, establishing the suite as strictly open-ended and generative. Sampling is weighted proportional to (category proportion), thereby increasing the prevalence of rarer mathematical topics. Deduplication by Jaccard 5-gram similarity (>0.4 threshold) ensures exclusion of near-duplicate train examples, preserving generalization integrity.
Final coverage spans eight key mathematical categories: algebra, analytic geometry, plane geometry, solid geometry, trigonometry, statistics, calculus/vector, transformational geometry. Each problem averages 1–2 sub-questions and mirrors the train set in length (text and image count), supporting consistency of evaluation.
2. Evaluation Protocol and Automated Grading
Evaluation is conducted via automated grading using GPT-4.1, which receives the question, ground-truth answers, and the LMM-generated solution (including both textual and visual outputs). The grader returns a JSON object specifying lists of ground-truth and predicted answers as well as correctness flags per sub-question.
Two principal metrics characterize performance on MathCanvas-Bench:
- Complete Accuracy (Acc_C): Yields 1 only if all sub-question correctness flags are true, enforcing a stringent requirement for holistic solution correctness.
- Weighted Score (Acc_W): Sub-questions are weighted to favor later reasoning steps via
so the most advanced steps contribute most to the score. Specific weights are applied, e.g., for , .
This protocol aligns evaluation with expert standards in mathematical problem-solving, emphasizing both step-by-step accuracy and strategic advancement.
3. Experimental Setup and Model Comparison
Twenty leading unified LMMs are evaluated on MathCanvas-Bench, including Gemini-2.5-Pro/Flash, GPT-4.1/4.1-mini/5, Claude-Sonnet-4, Seed-1.6, Qwen3-VL-Plus, Nano-Banana, Qwen-2.5-VL[7B/32B/72B], Gemma-3, InternVL3.5, Keye-VL-1.5, and the BAGEL family. Model inference is standardized using VLMEvalKit for consistent prompting and beam sizes across all runs; each model produces solutions for all 3,000 problems under controlled conditions.
Generated outputs (both diagrams and text) are submitted to the scoring pipeline, ensuring rigorous and reproducible evaluation against the benchmark.
4. Quantitative Performance Analysis
The benchmark highlights substantial differentiation among LMM architectures and training protocols. The MathCanvas-trained model, BAGEL-Canvas, achieves a Complete Accuracy of 21.9% and Weighted Score of 34.4%, representing +164% and +86% relative improvement, respectively, over the baseline BAGEL (8.3% / 18.5%). Subject-wise weighted gains are distributed as follows: Algebra +11.8, Analytic Geometry +14.1, Plane Geometry +19.2, Solid Geometry +12.3, Trigonometry +27.1, Statistics +9.9, Transformational Geometry +9.9, Calculus/Vector +0.8.
BAGEL-Canvas outperforms closed-source LMMs (e.g., Gemini 2.0-Flash and GPT-4.1) and narrows the gap to top proprietary systems. A plausible implication is that intrinsic visual chain-of-thought training frameworks substantially enhance both the fidelity and timing of diagrammatic reasoning steps.
Table: BAGEL-Canvas Performance on MathCanvas-Bench (Excerpt)
| Model | Complete (%) | Weighted (%) |
|---|---|---|
| BAGEL (base) | 8.3 | 18.5 |
| BAGEL-Canvas | 21.9 | 34.4 |
| Δ (relative) | +164% | +86% |
5. Generalization Beyond MathCanvas-Bench
Zero-shot evaluation reveals that MathCanvas-trained models generalize effectively to public mathematical reasoning benchmarks, even when restricted to text-only solutions. Key improvements include: MathVista-GPS +10.5 percentage points (68.8→79.3), MathVerse (Text Dominant/Lite) +16.2/17.9, MathVision +8.8 overall with category-specific improvements (e.g., Analytic Geometry +22.6, Algebra +13.0, Angle tasks +17.3).
This suggests that training on interleaved visual-text reasoning not only increases the model's proficiency in generating diagrams but also strengthens fundamental textual mathematical reasoning capabilities.
6. Significance and Implications
MathCanvas-Bench establishes a new standard for evaluating intrinsic visual chain-of-thought in LMMs, moving beyond conventional textual benchmarks into faithfully multimodal, generative problem settings. Its sampling strategy ensures robust coverage of mathematical subfields, while the grading protocol (via GPT-4.1) supplies both transparency and rigor. The benchmark is closely aligned with real-world expectations: complex, multi-step problems where strategic deployment of visual aids is not ancillary but essential.
A plausible implication is that MathCanvas-Bench—and similar rigorously-constructed benchmarks—will become indispensable tools for research into mathematical and multimodal reasoning, enabling reproducible comparison of advanced LMMs and informing further progress in both task design and model architecture.
7. Future Directions and Research Utility
The MathCanvas-Bench benchmark supports the ongoing development of intrinsic VCoT models, providing granular insight into model strengths and deficiencies across categories and solution components. With standardized protocols and categories, the resource is positioned for future expansion, facilitating investigations into curriculum scheduling, multimodal compositionality, and strategic reasoning. As multimodal model architectures continue to evolve, benchmarks of MathCanvas-Bench's fidelity are likely to serve as core evaluation suites driving progress in human-like mathematical reasoning with LMMs (Shi et al., 16 Oct 2025).