MathGLM-Vision: Solving Mathematical Problems with Multi-Modal Large Language Model

Published 10 Sep 2024 in cs.CL and cs.AI | (2409.13729v2)

Abstract: LLMs have demonstrated significant capabilities in mathematical reasoning, particularly with text-based mathematical problems. However, current multi-modal LLMs (MLLMs), especially those specialized in mathematics, tend to focus predominantly on solving geometric problems but ignore the diversity of visual information available in other areas of mathematics. Moreover, the geometric information for these specialized mathematical MLLMs is derived from several public datasets, which are typically limited in diversity and complexity. To address these limitations, we aim to construct a fine-tuning dataset named MathVL, and develop a series of specialized mathematical MLLMs termed MathGLM-Vision by conducting Supervised Fine-Tuning (SFT) on MathVL with various parameter-scale backbones. To extensively evaluate the effectiveness of MathGLM-Vision, we conduct experiments on several public benchmarks and our curated MathVL-test consisting of 2,000 problems. Experimental results demonstrate that MathGLM-Vision achieves significant improvements compared with some existing models, including backbone models and open-source mathematical MLLMs. These findings indicate the importance of diversity dataset in enhancing the mathematical reasoning abilities of MLLMs.

Abstract PDF HTML Upgrade to Chat

Citations (3)

View on Semantic Scholar

Summary

The paper introduces MathGLM-Vision, a series of multi-modal large language models fine-tuned on the diverse MathVL dataset to solve complex mathematical problems involving visual information.
By constructing the MathVL dataset, which includes diverse math topics beyond geometry, the research provides a richer foundation for training MLLMs to handle varied visual-linguistic mathematical tasks.
Evaluations show MathGLM-Vision models achieve significant performance improvements on benchmarks, demonstrating the critical role of integrated visual inputs for effective mathematical reasoning.

The paper "MathGLM-Vision: Solving Mathematical Problems with Multi-Modal LLM" introduces a novel approach to enhancing mathematical reasoning in multi-modal LLMs (MLLMs). The study addresses the limitations of existing models which predominantly focus on solving geometric problems while neglecting the diversity and complexity of visual information required in other mathematical domains.

Summary of Key Contributions

Introduction of MathVL Dataset: The authors construct a fine-tuning dataset called MathVL, which is designed to improve the mathematical reasoning abilities of MLLMs. MathVL is unique in its incorporation of diverse mathematical problems integrating both textual and visual data. It includes open-source datasets and newly curated Chinese K12 educational content, enhancing the scope beyond typical geometric problems to cover arithmetic, algebra, and statistics.
Development of MathGLM-Vision Series: By fine-tuning on the MathVL dataset, the authors introduce a series of models referred to as MathGLM-Vision. These models are built using different parameter-scale backbones (GLM-4V-9B, CogVLM2, and CogVLM-32B) with the aim to achieve significant improvements in solving complex mathematical problems that carry visual components.
Evaluation and Results: The paper reports extensive evaluations across several public benchmarks alongside a newly created MathVL-test, consisting of 2,000 problems. MathGLM-Vision shows marked improvements over existing models, achieving significant relative performance boosts on benchmark datasets like MathVista-GPS. For instance, MathGLM-Vision-9B achieved a 39.68% improvement over its backbone model.
Role of Visual Information: One of the critical insights from the experiments is the demonstrated significance of visual inputs. The results emphasize how integrating visual information dramatically enhances model performance in mathematical reasoning tasks, with appreciable declines in performance observed when visual inputs are excluded.
Discussion on Limitations and Challenges: The authors bring attention to three primary challenges with current MLLMs:
- Overemphasis on geometric problems.
- Limited dataset diversity hindering model adaptability.
- Lack of capability to process multiple image inputs simultaneously.

Detailed Observations

Dataset Diversity: MathVL's inclusion of diverse subjects and problem types emphasizes the broad applicability of the MathGLM-Vision models. It captures the essential step-by-step reasoning required, found missing in many existing datasets.
Model Architecture: By leveraging pre-trained MLLMs and adding layers for multi-modal integration, MathGLM-Vision models combine parameters from general LLMs with specialized vision encoders, facilitating improved comprehension of visual and textual information.
Experimental Setup: The work incorporates both closed and open-source competitors in its evaluations, ensuring robust validation of MathGLM-Vision's effectiveness. It provides substantial performance lifts across competitive benchmarks.
Generalizability and Robustness: The combination of visual question answering datasets with MathVL ensures that MathGLM-Vision models are not just specialized, but retain robust general vision-language understanding skills.

In conclusion, the paper provides a comprehensive approach to enhancing mathematical reasoning in MLLMs through well-curated datasets and sophisticated model fine-tuning, addressing key shortcomings in current methodologies and setting a higher benchmark in multi-modal mathematical problem-solving.