Papers
Topics
Authors
Recent
Search
2000 character limit reached

MATHWELL: Generating Educational Math Word Problems Using Teacher Annotations

Published 24 Feb 2024 in cs.CL | (2402.15861v5)

Abstract: Math word problems are critical K-8 educational tools, but writing them is time consuming and requires extensive expertise. To be educational, problems must be solvable, have accurate answers, and, most importantly, be educationally appropriate. We propose that LLMs have potential to support K-8 math education by automatically generating word problems. However, evaluating educational appropriateness is hard to quantify. We fill this gap by having teachers evaluate problems generated by LLMs, who find existing models and data often fail to be educationally appropriate. We then explore automatically generating educational word problems, ultimately using our expert annotations to finetune a 70B LLM. Our model, MATHWELL, is the first K-8 word problem generator targeted at educational appropriateness. Further expert studies find MATHWELL generates problems far more solvable, accurate, and appropriate than public models. MATHWELL also matches GPT-4's problem quality while attaining more appropriate reading levels for K-8 students and avoiding generating harmful questions.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. Shivam Bansal Aggarwal, Chaitanya. textstat: Calculate statistical features from text.
  2. Llemma: An Open Language Model For Mathematics. ArXiv:2310.10631 [cs].
  3. Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. ArXiv:2204.05862 [cs].
  4. Matthew L. Bernacki and Candace Walkington. 2018. The role of situational interest in personalized learning. Journal of Educational Psychology, 110(6):864–881. Place: US Publisher: American Psychological Association.
  5. Jeanne Sternlicht Chall and Edgar Dale. 1995. Readability Revisited: The New Dale-Chall Readability Formula. Brookline Books. Google-Books-ID: 2nbuAAAAMAAJ.
  6. Training Verifiers to Solve Math Word Problems. ArXiv:2110.14168 [cs].
  7. Word problems: a review of linguistic and numerical factors contributing to their difficulty. Frontiers in Psychology, 6:348.
  8. QLoRA: Efficient Finetuning of Quantized LLMs. ArXiv:2305.14314 [cs].
  9. R. Flesch. 1948. A new readability yardstick. The Journal of Applied Psychology, 32(3):221–233.
  10. PAL: Program-aided Language Models. ArXiv:2211.10435 [cs].
  11. Human-instruction-free llm self-alignment with limited samples.
  12. Measuring Massive Multitask Language Understanding. ArXiv:2009.03300 [cs].
  13. Measuring Mathematical Problem Solving With the MATH Dataset. ArXiv:2103.03874 [cs].
  14. Automatic Educational Question Generation with Difficulty Level Controls. In Artificial Intelligence in Education, Lecture Notes in Computer Science, pages 476–488, Cham. Springer Nature Switzerland.
  15. K5Learning. Free worksheets.
  16. Derivation of New Readability Formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy Enlisted Personnel:. Technical report, Defense Technical Information Center, Fort Belvoir, VA.
  17. A theme-rewriting approach for generating algebra word problems. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1617–1628, Austin, Texas. Association for Computational Linguistics.
  18. Solving Quantitative Reasoning Problems with Language Models. ArXiv:2206.14858 [cs].
  19. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct.
  20. A Diverse Corpus for Evaluating and Developing English Math Word Problem Solvers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 975–984, Online. Association for Computational Linguistics.
  21. LILA: A Unified Benchmark for Mathematical Reasoning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5807–5832, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  22. NumGLUE: A Suite of Fundamental yet Challenging Mathematical Reasoning Tasks. ArXiv:2204.05660 [cs].
  23. Math Word Problem Generation with Multilingual Language Models. In Proceedings of the 15th International Conference on Natural Language Generation, pages 144–155, Waterville, Maine, USA and virtual meeting. Association for Computational Linguistics.
  24. Rewriting math word problems with large language models. Proceedings of the Workshop on Empowering Education with LLMs-the Next-Gen Interface and Content Generation 2023 co-located with 24th International Conference on Artificial Intelligence in Education (AIED 2023), 3487:163–172.
  25. Training language models to follow instructions with human feedback. ArXiv:2203.02155 [cs].
  26. Are NLP Models really able to Solve Simple Math Word Problems? ArXiv:2103.07191 [cs].
  27. What teachers say about student difficulties solving mathematical word problems in grades 2-5. International Electronic Journal of Mathematics Education, 8:3–19.
  28. A Mathematical Word Problem Generator with Structure Planning and Knowledge Enhancement. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’23, pages 1750–1754, New York, NY, USA. Association for Computing Machinery.
  29. Anne Roche. 2013. Choosing, creating and using story problems: Some helpful hints. Australian Primary Mathematics Classroom, 18(1):30–35.
  30. Distilbert, a distilled version of BERT: smaller, faster, cheaper and lighter. CoRR, abs/1910.01108.
  31. Sarah Schwartz. 2023. Why Word Problems Are Such a Struggle for Students—And What Teachers Can Do. Education Week.
  32. Learning to summarize from human feedback. ArXiv:2009.01325 [cs].
  33. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
  34. Llama 2: Open foundation and fine-tuned chat models.
  35. TPT. Teaching Resources & Lesson Plans | TPT.
  36. VDOE. SOL Practice Items (All Subjects) | Virginia Department of Education.
  37. Word Problems in Mathematics Education: A Survey. ZDM: The International Journal on Mathematics Education, 52(1):1–16. Publisher: Springer ERIC Number: EJ1243930.
  38. Candace A. Walkington. 2013. Using adaptive learning technologies to personalize instruction to student interests: The impact of relevant contexts on performance and learning outcomes. Journal of Educational Psychology, 105(4):932–945. Place: US Publisher: American Psychological Association.
  39. Step-on-feet tuning: Scaling self-alignment of llms via bootstrapping.
  40. Self-Consistency Improves Chain of Thought Reasoning in Language Models. ArXiv:2203.11171 [cs].
  41. Math Word Problem Generation with Mathematical Consistency and Problem Context Constraints. ArXiv:2109.04546 [cs].
  42. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. ArXiv:2201.11903 [cs].
  43. HuggingFace’s Transformers: State-of-the-art Natural Language Processing. ArXiv:1910.03771 [cs].
  44. MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning. ArXiv:2309.05653 [cs].
  45. Bertscore: Evaluating text generation with bert.
  46. GPT-Fathom: Benchmarking Large Language Models to Decipher the Evolutionary Path towards GPT-4 and Beyond. ArXiv:2309.16583 [cs].
  47. Qingyu Zhou and Danqing Huang. 2019. Towards Generating Math Word Problems from Equations and Topics. In Proceedings of the 12th International Conference on Natural Language Generation, pages 494–503, Tokyo, Japan. Association for Computational Linguistics.
  48. Learning by analogy: Diverse questions generation in math word problem.
  49. Fine-Tuning Language Models from Human Preferences. ArXiv:1909.08593 [cs, stat].
  50. Mingyu Zong and Bhaskar Krishnamachari. 2023. Solving Math Word Problems concerning Systems of Equations with GPT-3. Proceedings of the AAAI Conference on Artificial Intelligence, 37(13):15972–15979. Number: 13.
  51. Zooniverse. Zooniverse.

Summary

  • The paper introduces a novel method for generating K-8 math word problems using expert teacher annotations and finetuning of Llama-2.
  • The methodology involves a two-stage finetuning process: first with general math QA datasets, then with expert-curated outputs to ensure solvability, accuracy, and appropriateness.
  • Results show that MATHWELL-generated problems achieve 74% human-verified quality and nearly 94.9% of GPT-4's performance in readability and accuracy metrics.

MATHWELL: Generating Educational Math Word Problems Using Teacher Annotations

Introduction

The paper "MATHWELL: Generating Educational Math Word Problems Using Teacher Annotations" proposes a novel approach to automatically generating educational math word problems that are suitable for K-8 students. The authors address the demand for customizable word problems in educational settings, emphasizing the importance of problems being solvable, accurate, and appropriate. Existing datasets lacked the necessary annotations for training word problem generators that meet these criteria, thus the authors curated a high-quality synthetic training dataset through expert annotations to finetune the Llama-2 (70B) LLM. This research culminated in the development of MATHWELL, categorized as a context-free word problem generator.

Methodology

The methodology centers on finetuning a pre-existing LLM to generate word problems that fulfill three critical criteria: solvability, accuracy, and appropriateness. The approach is split into two stages:

  1. Initial Finetuning: Utilization of existing datasets, unannotated for the tailored criteria but structured to handle math QA tasks, for a first round of finetuning Llama-2.
  2. Expert-Annotated Dataset: Generation of a secondary, more focused dataset from the initially finetuned model outputs, selecting those that human evaluators deemed solvable, accurate, and appropriate. MATHWELL was then further finetuned on this dataset. Figure 1

    Figure 1: MATHWELL is a finetuned LLM that generates educational math word problems that are solvable, accurate, and appropriate.

MATHWELL's design incorporates Program of Thought (PoT) solutions rather than Chain of Thought (CoT). This decision stems from recent findings that suggest PoT solutions enhance performance on open-ended questions, a category all K-8 problems inherently belong to.

SGSM Dataset Generation

Following the secondary phase of finetuning, MATHWELL was utilized to create the Synthetic Grade School Math (SGSM) dataset, consisting of 20,490 word problems. Within this dataset, human expert annotations confirmed that 74% of the samples simultaneously met the criteria of solvability, accuracy, and appropriateness. To ensure the dataset's quality, a subset was labeled comprehensively for potential future labor reduction in human-to-machine system applications. Figure 2

Figure 2: MATHWELL training and SGSM generation process. SFT denotes supervised finetuning and MaC refers to outputs meeting all criteria.

Results and Evaluation

The paper reports that MATHWELL's generated problems achieve a high degree of compliance with the educational criteria, outperforming most open-source models, and achieving approximately 94.9% of GPT-4's performance. In terms of readability, MATHWELL-generated questions align better with the appropriate grade-level reading standards for K-8 students than some existing datasets. Figure 3

Figure 3: Flesch-Kincaid grade level (FKGL) distribution of training datasets. Dotted lines show the mean for each dataset.

Figure 4

Figure 4: Flesch-Kincaid grade level (FKGL) distribution of model generations. Dotted lines show the mean for each model.

The comprehensive comparison of MATHWELL with other models indicates its superiority in generating well-rounded educational content. These comparisons utilize metrics such as BERTScore, perplexity, and reading level measures like FKGL and NDC, reinforcing MATHWELL’s alignment with educational requirements.

Discussion

The approach described in the paper not only offers a viable solution for generating student-interested, topic-specific math problems but also lays groundwork for future improvements in automated educational content generation. The MATHWELL framework's openness provides a significant contribution to academia and the wider educational sector, enabling further research in context-free problem generation.

Conclusion

MATHWELL represents a significant advancement in the automatic generation of educational content for younger students, demonstrating the power of targeted finetuning and expert annotation in creating classifiers that meet specific educational needs. Further research and optimization, particularly in aligning generations with detailed grade-level distinctions and mathematical topic breadth, should remain a priority to increase this model's practical applicability in real-world educational systems.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 1 like about this paper.