Autonomous Data Selection with Zero-shot Generative Classifiers for Mathematical Texts

Published 12 Feb 2024 in cs.CL, cs.AI, and cs.LG | (2402.07625v6)

Abstract: We present Autonomous Data Selection (AutoDS), a method that leverages base LLMs themselves as zero-shot "generative classifiers" to automatically curate high-quality mathematical texts. Unlike prior approaches that require human annotations or training a dedicated data filter, AutoDS relies solely on a model's logits to determine whether a given passage is mathematically informative and educational. By integrating AutoDS into a continual pretraining pipeline, we substantially boost downstream performance on challenging math benchmarks (MATH, GSM8K, and BBH) while using far fewer tokens than previous methods. Empirically, our approach achieves roughly a twofold improvement in pretraining token efficiency over strong baselines, underscoring the potential of self-directed data selection in enhancing mathematical reasoning. We release our curated AutoMathText dataset to facilitate future research in automated domain-specific data curation. The AutoMathText dataset is available at https://huggingface.co/datasets/math-ai/AutoMathText. The code is available at https://github.com/yifanzhang-pro/AutoMathText.

Abstract PDF Upgrade to Chat

Citations (17)

View on Semantic Scholar

Summary

The paper introduces a meta-prompted zero-shot verification methodology that autonomously selects high-quality mathematical texts without human-annotated data.
It employs a novel score function based on softmax outputs over tokens to evaluate the mathematical rigor and educational value of content.
Empirical results with a 7B parameter Mistral model show significant improvements on the MATH dataset, demonstrating efficient and effective data curation.

Enhancing Mathematical Reasoning in AI through Autonomous Data Selection: Insights from AutoMathText

Introduction to AutoMathText

In the evolving landscape of language modeling, the capability to infuse domain-specific knowledge into AI systems represents a crucial frontier. This is particularly salient in fields where precise and accurate reasoning—such as in mathematics—is essential. The AutoMathText initiative offers an innovative methodology for autonomously curating high-quality mathematical texts for training purposes. By leveraging meta-prompted LLMs in a zero-shot capacity as verifiers, this approach sidesteps the need for supervised fine-tuning or trained classifiers reliant on human-annotated data for content evaluation.

Methodological Underpinnings

A primary contribution of the AutoMathText project is its pioneering use of base LLMs equipped with meta-prompts for autonomous data evaluation, challenging the traditional reliance on binary classification for data curation. This nuanced strategy is embodied in the formulation of a score function based on the softmax output over specific tokens, enabling a granular assessment of the mathematical quality and educational value of a diverse array of content. This process allows for a sophisticated data curation strategy that transcends rudimentary binary filtering, positing a potent framework for enhancing the mathematical reasoning capabilities of AI models without the extensive need for human intervention.

Empirical Validation

The efficacy of the AutoMathText approach is substantiated through comprehensive experimentation with the Mistral LLM, boasting 7 billion parameters. Significantly, when continuously pretrained on the AutoMathText dataset, this model exhibited notable improvements in downstream tasks, specifically on the MATH dataset. This was achieved with a token count reduction by orders of magnitude vis-a-vis previous pretraining efforts, underscoring the method's efficiency and the high-quality nature of the autonomously selected dataset.

Theoretical and Practical Implications

From a theoretical standpoint, the use of meta-prompted zero-shot verifiers for autonomous data selection represents a paradigm shift in the pretraining of LLMs for specialized tasks. This method not only enhances the pretraining process by focusing on the most informative data points but also introduces a scalable and unbiased mechanism for data evaluation, free from human biases that might influence the content selection process. Practically, the development and open sourcing of the AutoMathText dataset catalyze further advancements in the AI model's ability to comprehend and solve complex mathematical tasks, delineating a path toward the realization of more intelligent and autonomous learning systems.

Future Trajectories

While AutoMathText's current implementation focuses on mathematical reasoning, its underlying methodology holds promise for broader applications across various specialized domains. Future explorations could extend this autonomous data selection framework to fields beyond STEM, potentially encompassing literature, history, or even nuanced interdisciplinary studies. Such endeavors would not only broaden the implications of this work but also contribute to the evolution of AI systems capable of engaging with and contributing knowledge across a wide spectrum of human intellectual pursuits.

In sum, AutoMathText marks a significant step forward in the quest to enhance the domain-specific reasoning capabilities of AI systems. By marrying the intrinsic capabilities of LLMs with sophisticated autonomous content evaluation strategies, this approach sets a new benchmark for the development of intelligent systems adept in specialized fields such as mathematics. Moving forward, the broader implications of this methodology for autonomous data curation and model training in varied domains beckon further inquiry and exploration.