Lean-ing on Quality: How High-Quality Data Beats Diverse Multilingual Data in AutoFormalization

Published 18 Feb 2025 in cs.AI, cs.CL, cs.LG, and cs.PL | (2502.15795v1)

Abstract: Autoformalization, the process of transforming informal mathematical language into formal specifications and proofs remains a difficult task for state-of-the-art (large) LLMs. Existing works point to competing explanations for the performance gap. To this end, we introduce a novel methodology that leverages back-translation with hand-curated prompts to enhance the mathematical capabilities of LLMs, particularly addressing the challenge posed by the scarcity of labeled data. Specifically, we evaluate three primary variations of this strategy: (1) on-the-fly (online) backtranslation, (2) distilled (offline) backtranslation with few-shot amplification, and (3) line-by-line proof analysis integrated with proof state information. Each variant is designed to optimize data quality over quantity, focusing on the high fidelity of generated proofs rather than sheer data scale. Our findings provide evidence that employing our proposed approaches to generate synthetic data, which prioritizes quality over volume, improves the Autoformalization performance of LLMs as measured by standard benchmarks such as ProofNet. Crucially, our approach outperforms pretrained models using a minimal number of tokens. We also show, through strategic prompting and backtranslation, that our approaches surpass the performance of fine-tuning with extensive multilingual datasets such as MMA on ProofNet with only 1/150th of the tokens. Taken together, our methods show a promising new approach to significantly reduce the resources required to formalize proofs, thereby accelerating AI for math.

Abstract PDF Upgrade to Chat

Summary

The paper demonstrates that backtranslation with curated prompts significantly improves autoformalization performance on mathematical proofs.
Methodologies like on-the-fly and distilled backtranslation generate high-fidelity proof data efficiently, reducing token usage compared to large datasets.
Empirical results on the ProofNet benchmark indicate that quality data leads to practical performance gains over traditional, diverse multilingual approaches.

Lean-ing on Quality: High-Quality Data in AutoFormalization

Introduction to AutoFormalization

Autoformalization aims to automate the translation of informal mathematical statements into formal proofs and specifications. Traditionally, LLMs have been leveraged for translation tasks due to their superior performance on linguistic problems. Despite this potential, LLMs often face challenges in processing complex mathematical syntax, which hampers their efficacy in the domain of formal theorem proving. The paper presents an innovative approach applying backtranslation with curated prompts to overcome the scarcity of formal-informal paired datasets, prioritizing data quality over quantity. This methodology enhances LLMs' capabilities by generating high-fidelity proof data, outperforming simple models trained on exhaustive multilingual datasets like MMA.

Methodologies for Data Generation

The research outlines three primary strategies to increase quality without increasing quantity:

On-The-Fly Backtranslation: This technique dynamically generates paired data in training by translating formal language (FL) examples into informal language (IL) and then back to FL. By iteratively updating model weights based on the divergence between generated and original FL, it effectively self-generates training data, circumventing the data scarcity issue. While efficient, it plateaus due to the limits in the generating model's capacity.
Distilled Backtranslation: Utilizing a powerful pretrained model, GPT-4, distilled backtranslation generates synthetic IL from the FL dataset. Here, few-shot amplification via rich prompts improves informalization quality, producing competitive results even with fewer tokens than traditional datasets. Two methods are discussed: translating entire theorem proofs and informalizing individual tactical steps by analyzing proof states before and after tactic application.
Regex-Based Data Capture: Employing regular expressions allows mining specific tactics in Lean code for rudimentary informalization. This process generates large datasets cheaply and increases transparency but compromises depth in informalization quality compared to more sophisticated methods.

Autoformalization Performance Evaluation

Performance improvements were assessed using the ProofNet benchmark, focusing on fine-tuning a GPT-2 model:

Few-Shot Prompting: The GPT-4 MathLib4 dataset, created with few-shot prompts, outperformed the expansive MMA dataset while using a fraction of the tokens, proving the superiority of enriched data prompts over mere dataset size.
Tactic-Based Method: Informalizing individual tactics demonstrated strong performance, despite cost constraints, highlighting the advantage of modeling proof steps explicitly.
On-The-Fly and Regex-Based Results: These methods showed only modest improvements due to simpler informalizations and smaller model capacity, advocating the use of larger models for best results.

Implications and Future Directions

The paper suggests advocating for quality data approaches to autoformalization, positing that rich informalization leverages advanced models' strengths better than diverse, large-scale datasets alone. Future research could benefit from deploying larger models, validating full autoformalization—the conversion of entire proofs—and quantify improvements using interactive theorem proving environments to ensure correct compilations.

Conclusion

Ultimately, by prioritizing data quality derived from strategic prompting and synthesis rather than sheer quantity, the paper demonstrates a resource-efficient pathway towards improving mathematical understanding with AI. This approach might radically reduce the resources required for formalizing mathematical proofs, indicating significant potential advancements in AI's application to mathematical sciences.