- The paper introduces a dataset (ProbleMathic) and an adversarial prompting framework that boost LLM performance on math word problems.
- It employs a constrained additive approach to generate adversarial examples that add noise while preserving the original problem's solution.
- Fine-tuning models on these adversarial datasets improves LLM robustness by about 8%, enhancing their practical reliability in real-world scenarios.
Introduction
The paper explores the susceptibility of LLMs to irrelevant information when solving Math Word Problems (MWPs). Traditional datasets often lack the complexity of real-world problems, which include extraneous data that can mislead models. This research introduces a new dataset, ProbleMathic, and a novel prompting framework to examine and enhance the robustness of LLMs against such numerical noise.
Proposed Framework and Dataset
The ProbleMathic Dataset
To systematically analyze LLMs' vulnerability to irrelevant information, the paper introduces ProbleMathic, a dataset composed of both adversarial and non-adversarial MWPs. The dataset includes problems of varying difficulty — simple problems requiring basic arithmetic and complex problems involving rates and proportions. ProbleMathic challenges LLMs to discern and focus on relevant information while ignoring distractions.
Adversarial Data Generation
The paper details a constrained additive approach to create adversarial variants of MWPs by introducing noise while keeping the original solution intact. This adversarial augmentation involves adding unrelated numerical variables, ensuring they don't alter the problem's original query or solution. By doing so, the study evaluates an LLM's reasoning capabilities beyond simple numerical manipulation.
Experimental Setup and Results
Fine-tuning LLMs
The study employs models like Qwen-2 and Mistral, fine-tuning them on the adversarial variants of ProbleMathic. The results indicate approximately an 8% improvement in handling adversarial MWPs post fine-tuning, suggesting enhanced model robustness to irrelevant data.
Generalization to the GSM-8K-Adv Benchmark
To assert the generalizability of their approach, the authors introduce GSM-8K-Adv, an adversarial counterpart to the GSM-8K benchmark. Evaluation on GSM-8K-Adv revealed that LLMs experience a performance drop due to adversarial elements, further confirming the models' struggle with irrelevant information.
Practical Implications
The findings have significant implications for deploying LLMs in domains where precise problem-solving is critical, such as education or scientific computing. By improving LLMs' ability to ignore numerical noise, this research points towards more reliable LLMs that can operate proficiently in real-world scenarios laden with irrelevant data.
Conclusion
The paper presents a comprehensive framework for generating adversarial math problems and improving LLMs' robustness against them. Fine-tuning models with adversarial samples enhances their performance, indicating a viable path for developing more reliable AI systems capable of nuanced reasoning. The introduction of datasets like ProbleMathic and GSM-8K-Adv sets a new standard for evaluating and boosting the reasoning capabilities of LLMs in mathematics.