InfinityMATH: A Scalable Instruction Tuning Dataset in Programmatic Mathematical Reasoning

Published 9 Aug 2024 in cs.LG and cs.AI | (2408.07089v1)

Abstract: Recent advancements in Chain-of-Thoughts (CoT) and Program-of-Thoughts (PoT) methods have greatly enhanced LLMs' mathematical reasoning capabilities, facilitating their integration into instruction tuning datasets with LLMs. However, existing methods for large-scale dataset creation require substantial seed data and high computational costs for data synthesis, posing significant challenges for scalability. We introduce InfinityMATH, a scalable instruction tuning dataset for programmatic mathematical reasoning. The construction pipeline emphasizes decoupling numbers from mathematical problems to synthesize number-independent programs, enabling efficient and flexible scaling while minimizing dependency on specific numerical values. Fine-tuning experiments with open-source language and code models, such as Llama2 and CodeLlama, demonstrate the practical benefits of InfinityMATH. These fine-tuned models, showed significant relative improvements on both in-domain and out-of-domain benchmarks, ranging from 184.7% to 514.3% on average. Additionally, these models exhibited high robustness on the GSM8K+ and MATH+ benchmarks, which are enhanced version of test sets with simply the number variations. InfinityMATH ensures that models are more versatile and effective across a broader range of mathematical problems. The data is available at https://huggingface.co/datasets/flagopen/InfinityMATH.

Abstract PDF HTML Upgrade to Chat

Authors (4)

Citations (4)

View on Semantic Scholar

Summary

The paper presents InfinityMATH, a novel dataset that decouples numerical values from problems to create universal templates for improved LLM reasoning.
The paper details a multi-step pipeline leveraging models like GPT-4 to efficiently generate and augment instruction tuning data while reducing computational costs.
The paper demonstrates robust improvements, with fine-tuned models achieving up to 316% better performance on GSM8K and substantial gains on out-of-domain benchmarks.

nfinity: A Scalable Instruction Tuning Dataset in Programmatic Mathematical Reasoning

The paper "nfinity: A Scalable Instruction Tuning Dataset in Programmatic Mathematical Reasoning" presents a novel approach for improving the mathematical reasoning capabilities of LLMs through the introduction of a scalable dataset, nfinity. The authors offer a precise methodology that addresses existing limitations in data scalability and computational efficiency.

Problem Statement

The advancement of Chain-of-Thought (CoT) and Program-of-Thought (PoT) methodologies has significantly improved LLMs' mathematical reasoning abilities. However, existing methods for large-scale dataset creation are constrained by substantial seed data requirements and high computational costs. This paper introduces nfinity, a dataset designed to enable scalable instruction tuning for programmatic mathematical reasoning.

Methodology

Data Synthesis

The key innovation in building nfinity is decoupling numerical values from mathematical problems to create "universal templates." This approach allows the synthesis of number-independent programs, which can then be populated with varied numerical values to create extensive datasets efficiently. The authors utilized a multi-step pipeline leveraging LLMs like GPT-4 to generate programs from generalized mathematical problems. Each data point in nfinity includes a generalized problem statement and a corresponding programmatic solution, formatted as function call-based programs with detailed docstrings and comments.

Scaling with Data Augmentation

nfinity's construction pipeline allows for significant scaling without further dependence on LLMs. By abstracting variables and iterating over multiple numerical substitutions, the dataset effectively transforms smaller sets of problems into a large and varied dataset. This approach is not only cost-efficient but also enhances the robustness of the trained models by exposing them to diverse numerical variations.

Experimental Results

The effectiveness of nfinity was benchmarked using fine-tuning experiments on LLMs such as Llama2, CodeLlama, and Aquila2. These fine-tuned models were evaluated on both in-domain and out-of-domain benchmarks, including GSM8K, MATH, and several others. The models showed significant performance improvements:

On GSM8K, models fine-tuned with nfinity exhibited improvements ranging from 150.40% to 316.44%.
On out-of-domain evaluation sets like SVAMP and SimulEq, improvements were also substantial, with some reaching over 1000%.

Implications and Future Directions

Practical Implications

The practical implications of this paper are manifold. The ability to scale mathematical reasoning datasets efficiently allows academic and industry researchers to train more robust LLMs without prohibitive computational costs. The robustness of models trained on nfinity ensures that they perform well across a broader range of mathematical problems, contributing to their applicability in various real-world scenarios requiring complex mathematical reasoning.

Theoretical Implications

Theoretically, nfinity addresses a critical gap in the logical consistency of LLM-generated programs. By decoupling numerical values, the paper reveals how slight numerical variations can disrupt program logic, thereby highlighting the importance of robust dataset design. This approach not only enhances the model's accuracy but also provides a framework for future datasets aiming to address similar issues in other domains.

Future Developments

Future research could explore the extension of nfinity's methodology to other domains, such as scientific computing or financial modeling. Additionally, integrating advanced data augmentation techniques could further enhance the dataset's diversity and robustness. Continuous evaluation and enhancement of logical consistency in the generated programs would remain a critical focus area.

Conclusion

In conclusion, the paper "nfinity: A Scalable Instruction Tuning Dataset in Programmatic Mathematical Reasoning" introduces a novel and practical approach to improving LLMs' mathematical reasoning capabilities. Through innovative data synthesis and augmentation techniques, nfinity addresses scalability challenges and enhances the robustness of programmatic solutions. The significant performance improvements demonstrated in fine-tuning experiments underscore the dataset's effectiveness and potential for broader applications in AI and machine learning.

Markdown Report Issue