Mathesis: Towards Formal Theorem Proving from Natural Languages

Published 8 Jun 2025 in cs.AI | (2506.07047v1)

Abstract: Recent advances in LLMs show strong promise for formal reasoning. However, most LLM-based theorem provers have long been constrained by the need for expert-written formal statements as inputs, limiting their applicability to real-world problems expressed in natural language. We tackle this gap with Mathesis, the first end-to-end theorem proving pipeline processing informal problem statements. It contributes Mathesis-Autoformalizer, the first autoformalizer using reinforcement learning to enhance the formalization ability of natural language problems, aided by our novel LeanScorer framework for nuanced formalization quality assessment. It also proposes a Mathesis-Prover, which generates formal proofs from the formalized statements. To evaluate the real-world applicability of end-to-end formal theorem proving, we introduce Gaokao-Formal, a benchmark of 488 complex problems from China's national college entrance exam. Our approach is carefully designed, with a thorough study of each component. Experiments demonstrate Mathesis's effectiveness, with the autoformalizer outperforming the best baseline by 22% in pass-rate on Gaokao-Formal. The full system surpasses other model combinations, achieving 64% accuracy on MiniF2F with pass@32 and a state-of-the-art 18% on Gaokao-Formal.

Abstract PDF Upgrade to Chat

Summary

The paper introduces an end-to-end theorem proving pipeline that autoformalizes natural language into formal Lean code using reinforcement learning.
It employs innovative evaluation methods with LeanScorer, combining syntactic checks and fuzzy semantic assessments for improved accuracy.
The system demonstrates state-of-the-art performance on benchmarks like Gaokao-Formal, achieving a 22% improvement in pass rates.

Mathesis: Towards Formal Theorem Proving from Natural Languages

The paper "Mathesis: Towards Formal Theorem Proving from Natural Languages" addresses the challenge of automating theorem proving from natural language statements by introducing an end-to-end pipeline combining autoformalization with formal theorem proving techniques. This represents a significant advancement in the field by attempting to bridge the gap between informal problem statements and formal theorem proving systems.

End-to-End Theorem Proving Pipeline

The proposed pipeline begins with an autoformalization stage using Mathesis-Autoformalizer, which translates informal natural language problem statements into formal Lean code. This component is novel in that it utilizes reinforcement learning to enhance the translation quality dynamically, leveraging both syntactic correctness and semantic preservation as reward signals.

Figure 1: Performance of end-to-end theorem proving on the MiniF2F test set.

The system progresses to the validation phase, where formalized statements undergo syntactic checks with Lean and semantic assessments using the LeanScorer framework, ensuring both mechanical validity and alignment with the original problem intent. The final stage involves a prover generating formal proofs from the validated formal statements, utilizing Mathesis-Prover, trained with expert iteration to enhance its problem-solving capabilities. The integration of LeanScorer for nuanced evaluation represents an innovative approach to assessing autoformalization correctness.

Mathesis-Autoformalizer: Advancing Autoformalization with Reinforcement Learning

Mathesis-Autoformalizer leverages Group Relative Policy Optimization (GRPO) to refine the translation process continuously, using rewards from syntactic checks and semantic evaluations. This strategy introduces a dynamic learning framework that allows the autoformalizer to incrementally improve its capabilities, taking advantage of both syntactic verification and semantic alignment.

The Hierarchical Preference Optimization technique further refines the translation by aligning both local and global objectives. By incorporating Direct Preference Optimization (DPO) in its training, Mathesis-Autoformalizer improves task-grounded output preferences, further enhancing translation fidelity and capability.

LeanScorer: Nuanced Evaluation of Autoformalization Correctness

LeanScorer introduces a robust evaluation system for formalizations, focusing on semantic integrity. Utilizing a two-stage process, it first decomposes problem statements into subtasks and then employs a three-tier rating by LLM for assessment. The Sugeno Fuzzy Integral aggregates these categorical ratings, providing a continuous correctness score that offers flexibility in evaluation precision, accommodating nuanced LLM judgments while enforcing rigorous standards.

Figure 2: Overview of our LeanScorer evaluation method.

Benchmark: Gaokao-Formal

To evaluate the efficacy of autoformalization and end-to-end theorem proving systems, the paper introduces the Gaokao-Formal benchmark. It comprises 488 complex problems from China's National Gaokao exam, providing a rigorous testbed for assessing autoformalization complexity and capability across diverse domains.

Experimental Results

The Mathesis pipeline achieves state-of-the-art performance on Gaokao-Formal, demonstrating significant improvements over existing baselines. The autoformalizer outperforms previous models by 22% on Gaokao-Formal pass rates, while the end-to-end system achieves 64% accuracy on MiniF2F with pass@32 and reaches a benchmark of 18% on Gaokao-Formal.

Figure 3: Performance improvement by category.

Conclusion

The paper succeeds in advancing end-to-end theorem proving by integrating formal reasoning capabilities from informal natural language inputs, establishing a new benchmark with Gaokao-Formal, and delivering enhanced performance measures with Mathesis-Autoformalizer and Mathesis-Prover. This work paves the way for future research into complex mathematical problem solving directly from natural language inputs, addressing both theoretical and practical challenges.

Markdown Report Issue