Papers
Topics
Authors
Recent
Search
2000 character limit reached

Adversarial Math Word Problem Generation

Published 27 Feb 2024 in cs.CL and cs.AI | (2402.17916v3)

Abstract: LLMs have significantly transformed the educational landscape. As current plagiarism detection tools struggle to keep pace with LLMs' rapid advancements, the educational community faces the challenge of assessing students' true problem-solving abilities in the presence of LLMs. In this work, we explore a new paradigm for ensuring fair evaluation -- generating adversarial examples which preserve the structure and difficulty of the original questions aimed for assessment, but are unsolvable by LLMs. Focusing on the domain of math word problems, we leverage abstract syntax trees to structurally generate adversarial examples that cause LLMs to produce incorrect answers by simply editing the numeric values in the problems. We conduct experiments on various open- and closed-source LLMs, quantitatively and qualitatively demonstrating that our method significantly degrades their math problem-solving ability. We identify shared vulnerabilities among LLMs and propose a cost-effective approach to attack high-cost models. Additionally, we conduct automatic analysis to investigate the cause of failure, providing further insights into the limitations of LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (31)
  1. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712.
  2. Can gpt models be financial analysts? an evaluation of chatgpt and gpt-4 on mock cfa exams. arXiv preprint arXiv:2310.08678.
  3. Are aligned neural networks adversarially aligned? arXiv preprint arXiv:2306.15447.
  4. Chaka Chaka. 2023. Detecting ai content in responses generated by chatgpt, youchat, and chatsonic: The case of five ai content detection tools. Journal of Applied Learning and Teaching, 6(2).
  5. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023).
  6. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
  7. Pal: Program-aided language models. In International Conference on Machine Learning, pages 10764–10799. PMLR.
  8. Solving math word problems by combining language models with symbolic solvers. arXiv preprint arXiv:2304.09102.
  9. Adversarial examples are not bugs, they are features. Advances in neural information processing systems, 32.
  10. Mathprompter: Mathematical reasoning using large language models. In Proceedings of the The 61st Annual Meeting of the Association for Computational Linguistics: Industry Track, ACL 2023, Toronto, Canada, July 9-14, 2023, pages 37–42. Association for Computational Linguistics.
  11. Mistral 7b. arXiv preprint arXiv:2310.06825.
  12. A watermark for large language models. arXiv preprint arXiv:2301.10226.
  13. Performance of chatgpt on usmle: Potential for ai-assisted medical education using large language models. PLoS digital health, 2(2):e0000198.
  14. Chain of code: Reasoning with a language model-augmented code emulator. arXiv preprint arXiv:2312.04474.
  15. Gpt detectors are biased against non-native english writers. arXiv preprint arXiv:2304.02819.
  16. Detectgpt: Zero-shot machine-generated text detection using probability curvature. arXiv preprint arXiv:2301.11305.
  17. OpenAI. 2022. Chatgpt: Optimizing language models for dialogue. Accessed: 2023-01-10.
  18. OpenAI. 2023. Gpt-4 technical report. ArXiv, abs/2303.08774.
  19. Subhro Roy and Dan Roth. 2015. Solving general arithmetic word problems. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1743–1752, Lisbon, Portugal. Association for Computational Linguistics.
  20. Reasoning about quantities in natural language. Transactions of the Association for Computational Linguistics, 3:1–13.
  21. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950.
  22. Can ai-generated text be reliably detected? arXiv preprint arXiv:2303.11156.
  23. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  24. Adversarial glue: A multi-task benchmark for robustness evaluation of language models. arXiv preprint arXiv:2111.02840.
  25. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  26. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244.
  27. Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284.
  28. Adversarial attacks on deep-learning models in natural language processing: A survey. ACM Transactions on Intelligent Systems and Technology (TIST), 11(3):1–41.
  29. Mathattack: Attacking large language models towards math solving ability. arXiv preprint arXiv:2309.01686.
  30. Promptbench: Towards evaluating the robustness of large language models on adversarial prompts. arXiv preprint arXiv:2306.04528.
  31. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043.

Summary

  • The paper introduces an adversarial attack that transforms math word problems into Python ASTs, systematically altering numeric values.
  • The method enforces constraints like positivity and integer preservation to maintain the original problem's structure and difficulty.
  • Experiments reveal significant accuracy drops in LLMs, with up to a 100% attack success rate on weaker models and a 60 ASR point improvement over rephrasing methods.

Adversarial Math Word Problem Generation

Abstract and Introduction

The paper addresses the challenge of ensuring fair assessment of students' problem-solving abilities in the presence of LLMs by exploring a method to generate adversarial math word problems (MWPs) that are unsolvable by LLMs, yet maintain the original problems' structure and difficulty. This approach leverages abstract syntax trees (ASTs) to systematically alter numeric values in problems, revealing shared vulnerabilities among LLMs and proposing a cost-effective strategy to exploit high-cost models.

Methodology

Problem Transformation and Adversarial Generation

The proposed method converts solvable MWPs into Python code, subsequently into AST representations. This transformation facilitates the generation of adversarial examples by altering numeric values under constraints to retain problem difficulty. Figure 1

Figure 1: Method Overview Given a MWP that an LLM can correctly solve, our method first transforms it into Python code... Despite this, we find that the resulting adversarial examples cause LLMs to predict incorrect answers.

  • Code to AST Conversion: The Python-generated solution is transformed into an AST, where variable nodes represent numeric values from the problem. Adversarial examples are generated by modifying these nodes under a set of constraints.
  • Constraints Implementation: Boolean constraints such as positivity, integer type, and proper fraction preservation are employed to ensure adversarial examples maintain logical problem structure.

Results

Efficacy on Various LLMs

Experiments conducted on eight different LLMs, including GPT-4 and MetaMath 70B, demonstrated a significant attack success rate (ASR). Under the most restrictive generation method, M3, weaker models like Llama 2 13B consistently produced incorrect answers. Figure 2

Figure 2

Figure 2: Human Evaluation (Left)... Transferability (Right)...

The study highlights a stark accuracy drop across all models when faced with adversarial examples, with Mistral 7B and CodeLlama 34B showcasing a 100% ASR. This suggests a profound vulnerability in LLMs’ numeric reasoning capabilities.

  • Baseline Comparison: The proposed method surpasses previous rephrasing attacks in degrading model performance by over 60 ASR points on average.
  • Efficient Targeting of High-Cost Models: The attack on GPT-4 achieved similar ASR while reducing API request calls by up to 90% using adversarial examples from cheaper models, indicating a practical approach to resource-intensive scenarios.

Analysis and Discussion

Human Evaluation and Feature Impact

The human evaluation confirmed the coherence and difficulty preservation in adversarial examples generated by M3. Regression analysis further identified LLM performance reliance on specific numerical ranges and operation complexity.

Transferability of Attacks

The commonality in vulnerabilities among weaker models was evident, with consistent attack transferability observed. This suggests universal weaknesses that accentuate LLMs' limitations in arithmetic problem-solving, underscoring potential areas for robustness improvement.

Conclusion

The paper presents a novel adversarial attack methodology to stress-test LLMs' mathematical capabilities, ensuring educational integrity through robust problem generation. This approach not only exposes inherent limitations in current AI models but also informs future developments aimed at enhancing model robustness against adversarial influences. Further work should explore adapting these methods for more complex problem types, potentially aligning LLM capabilities with human-like reasoning standards in educational settings.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 25 likes about this paper.