Cutting Through the Noise: Boosting LLM Performance on Math Word Problems

Published 30 May 2024 in cs.CL | (2406.15444v4)

Abstract: LLMs excel at various tasks, including solving math word problems (MWPs), but struggle with real-world problems containing irrelevant information. To address this, we propose a prompting framework that generates adversarial variants of MWPs by adding irrelevant variables. We introduce a dataset, PROBLEMATHIC, containing both adversarial and non-adversarial MWPs. Our experiments reveal that LLMs are susceptible to distraction by numerical noise, resulting in an average relative performance drop of ~26% on adversarial MWPs. To mitigate this, we fine-tune LLMs (Llama-2, Mistral) on the adversarial samples from our dataset. Fine-tuning on adversarial training instances improves performance on adversarial MWPs by ~8%, indicating increased robustness to noise and improved ability to identify relevant data for reasoning. Finally, to assess the generalizability of our prompting framework, we introduce GSM-8K-Adv, an adversarial variant of the GSM-8K benchmark. LLMs continue to struggle when faced with adversarial information, reducing performance by up to 6%.

Abstract PDF HTML Upgrade to Chat

Citations (3)

View on Semantic Scholar

Summary

The paper introduces a dataset (ProbleMathic) and an adversarial prompting framework that boost LLM performance on math word problems.
It employs a constrained additive approach to generate adversarial examples that add noise while preserving the original problem's solution.
Fine-tuning models on these adversarial datasets improves LLM robustness by about 8%, enhancing their practical reliability in real-world scenarios.

Cutting Through the Noise: Boosting LLM Performance on Math Word Problems

Introduction

The paper explores the susceptibility of LLMs to irrelevant information when solving Math Word Problems (MWPs). Traditional datasets often lack the complexity of real-world problems, which include extraneous data that can mislead models. This research introduces a new dataset, ProbleMathic, and a novel prompting framework to examine and enhance the robustness of LLMs against such numerical noise.

Proposed Framework and Dataset

The ProbleMathic Dataset

To systematically analyze LLMs' vulnerability to irrelevant information, the paper introduces ProbleMathic, a dataset composed of both adversarial and non-adversarial MWPs. The dataset includes problems of varying difficulty — simple problems requiring basic arithmetic and complex problems involving rates and proportions. ProbleMathic challenges LLMs to discern and focus on relevant information while ignoring distractions.

Adversarial Data Generation

The paper details a constrained additive approach to create adversarial variants of MWPs by introducing noise while keeping the original solution intact. This adversarial augmentation involves adding unrelated numerical variables, ensuring they don't alter the problem's original query or solution. By doing so, the study evaluates an LLM's reasoning capabilities beyond simple numerical manipulation.

Experimental Setup and Results

Fine-tuning LLMs

The study employs models like Qwen-2 and Mistral, fine-tuning them on the adversarial variants of ProbleMathic. The results indicate approximately an 8% improvement in handling adversarial MWPs post fine-tuning, suggesting enhanced model robustness to irrelevant data.

Generalization to the GSM-8K-Adv Benchmark

To assert the generalizability of their approach, the authors introduce GSM-8K-Adv, an adversarial counterpart to the GSM-8K benchmark. Evaluation on GSM-8K-Adv revealed that LLMs experience a performance drop due to adversarial elements, further confirming the models' struggle with irrelevant information.

Practical Implications

The findings have significant implications for deploying LLMs in domains where precise problem-solving is critical, such as education or scientific computing. By improving LLMs' ability to ignore numerical noise, this research points towards more reliable LLMs that can operate proficiently in real-world scenarios laden with irrelevant data.

Conclusion

The paper presents a comprehensive framework for generating adversarial math problems and improving LLMs' robustness against them. Fine-tuning models with adversarial samples enhances their performance, indicating a viable path for developing more reliable AI systems capable of nuanced reasoning. The introduction of datasets like ProbleMathic and GSM-8K-Adv sets a new standard for evaluating and boosting the reasoning capabilities of LLMs in mathematics.