MATHWELL: Generating Educational Math Word Problems Using Teacher Annotations

Published 24 Feb 2024 in cs.CL | (2402.15861v5)

Abstract: Math word problems are critical K-8 educational tools, but writing them is time consuming and requires extensive expertise. To be educational, problems must be solvable, have accurate answers, and, most importantly, be educationally appropriate. We propose that LLMs have potential to support K-8 math education by automatically generating word problems. However, evaluating educational appropriateness is hard to quantify. We fill this gap by having teachers evaluate problems generated by LLMs, who find existing models and data often fail to be educationally appropriate. We then explore automatically generating educational word problems, ultimately using our expert annotations to finetune a 70B LLM. Our model, MATHWELL, is the first K-8 word problem generator targeted at educational appropriateness. Further expert studies find MATHWELL generates problems far more solvable, accurate, and appropriate than public models. MATHWELL also matches GPT-4's problem quality while attaining more appropriate reading levels for K-8 students and avoiding generating harmful questions.

Abstract PDF HTML Upgrade to Chat

References (51)

Summary

The paper introduces a novel method for generating K-8 math word problems using expert teacher annotations and finetuning of Llama-2.
The methodology involves a two-stage finetuning process: first with general math QA datasets, then with expert-curated outputs to ensure solvability, accuracy, and appropriateness.
Results show that MATHWELL-generated problems achieve 74% human-verified quality and nearly 94.9% of GPT-4's performance in readability and accuracy metrics.

MATHWELL: Generating Educational Math Word Problems Using Teacher Annotations

Introduction

The paper "MATHWELL: Generating Educational Math Word Problems Using Teacher Annotations" proposes a novel approach to automatically generating educational math word problems that are suitable for K-8 students. The authors address the demand for customizable word problems in educational settings, emphasizing the importance of problems being solvable, accurate, and appropriate. Existing datasets lacked the necessary annotations for training word problem generators that meet these criteria, thus the authors curated a high-quality synthetic training dataset through expert annotations to finetune the Llama-2 (70B) LLM. This research culminated in the development of MATHWELL, categorized as a context-free word problem generator.

Methodology

The methodology centers on finetuning a pre-existing LLM to generate word problems that fulfill three critical criteria: solvability, accuracy, and appropriateness. The approach is split into two stages:

Initial Finetuning: Utilization of existing datasets, unannotated for the tailored criteria but structured to handle math QA tasks, for a first round of finetuning Llama-2.
Expert-Annotated Dataset: Generation of a secondary, more focused dataset from the initially finetuned model outputs, selecting those that human evaluators deemed solvable, accurate, and appropriate. MATHWELL was then further finetuned on this dataset.
Figure 1: MATHWELL is a finetuned LLM that generates educational math word problems that are solvable, accurate, and appropriate.

MATHWELL's design incorporates Program of Thought (PoT) solutions rather than Chain of Thought (CoT). This decision stems from recent findings that suggest PoT solutions enhance performance on open-ended questions, a category all K-8 problems inherently belong to.

SGSM Dataset Generation

Following the secondary phase of finetuning, MATHWELL was utilized to create the Synthetic Grade School Math (SGSM) dataset, consisting of 20,490 word problems. Within this dataset, human expert annotations confirmed that 74% of the samples simultaneously met the criteria of solvability, accuracy, and appropriateness. To ensure the dataset's quality, a subset was labeled comprehensively for potential future labor reduction in human-to-machine system applications.

Figure 2: MATHWELL training and SGSM generation process. SFT denotes supervised finetuning and MaC refers to outputs meeting all criteria.

Results and Evaluation

The paper reports that MATHWELL's generated problems achieve a high degree of compliance with the educational criteria, outperforming most open-source models, and achieving approximately 94.9% of GPT-4's performance. In terms of readability, MATHWELL-generated questions align better with the appropriate grade-level reading standards for K-8 students than some existing datasets.

Figure 3: Flesch-Kincaid grade level (FKGL) distribution of training datasets. Dotted lines show the mean for each dataset.

Figure 4: Flesch-Kincaid grade level (FKGL) distribution of model generations. Dotted lines show the mean for each model.

The comprehensive comparison of MATHWELL with other models indicates its superiority in generating well-rounded educational content. These comparisons utilize metrics such as BERTScore, perplexity, and reading level measures like FKGL and NDC, reinforcing MATHWELL’s alignment with educational requirements.

Discussion

The approach described in the paper not only offers a viable solution for generating student-interested, topic-specific math problems but also lays groundwork for future improvements in automated educational content generation. The MATHWELL framework's openness provides a significant contribution to academia and the wider educational sector, enabling further research in context-free problem generation.

Conclusion

MATHWELL represents a significant advancement in the automatic generation of educational content for younger students, demonstrating the power of targeted finetuning and expert annotation in creating classifiers that meet specific educational needs. Further research and optimization, particularly in aligning generations with detailed grade-level distinctions and mathematical topic breadth, should remain a priority to increase this model's practical applicability in real-world educational systems.

Markdown Report Issue