Goedel-Prover-V2: Scaling Formal Theorem Proving with Scaffolded Data Synthesis and Self-Correction

Published 5 Aug 2025 in cs.LG and cs.AI | (2508.03613v1)

Abstract: We introduce Goedel-Prover-V2, a series of open-source LLMs that set a new state-of-the-art in automated theorem proving. Built on the standard expert iteration and reinforcement learning pipeline, our approach incorporates three key innovations: (1) Scaffolded data synthesis: We generate synthetic tasks of increasing difficulty to train the model to master increasingly complex theorems; (2) Verifier-guided self-correction: We enable the model to iteratively revise its proofs by leveraging feedback from the Lean compiler; (3) Model averaging: We merge model checkpoints to mitigate the decrease in model output diversity in later stages of training. Our small model, Goedel-Prover-V2-8B, reaches 84.6% pass@32 on MiniF2F and outperforms DeepSeek-Prover-V2-671B under the same metric, despite being 80X smaller. Our flagship model, Goedel-Prover-V2-32B, achieves 88.1% on MiniF2F at pass@32 in standard mode and 90.4% in self-correction mode, outperforming prior SOTA by a large margin. Additionally, our flagship model solves 86 problems on PutnamBench at pass@184, securing the first place among open-source models on the leaderboard, surpassing DeepSeek-Prover-V2-671B's record of solving 47 problems by pass@1024 with a significantly smaller model size and compute budget. At the time of its release (July-August 2025), Goedel-Prover-V2 achieves the strongest overall performance among all open-source theorem provers. It also ranks among the top-performing models--including closed-source systems with publicly reported performance--under a constrained test-time compute budget. Our models, code, and data are released at https://github.com/Goedel-LM/Goedel-Prover-V2.

Abstract PDF Upgrade to Chat

Summary

The paper introduces scaffolded data synthesis and verifier-guided self-correction to iteratively repair and refine proofs in the Lean system.
It employs an innovative training pipeline combining long chain-of-thought reasoning, self-correction using Lean compiler feedback, and model averaging to boost performance.
Empirical results show a significant accuracy boost with up to 90.4% pass@32 on MiniF2F, achieving state-of-the-art performance with reduced model size and compute.

Goedel-Prover-V2: Advancing Formal Theorem Proving via Scaffolded Data Synthesis and Self-Correction

Introduction

Goedel-Prover-V2 presents a significant advancement in automated formal theorem proving, specifically targeting the Lean proof assistant. The model series leverages a combination of long chain-of-thought (CoT) reasoning, scaffolded data synthesis, verifier-guided self-correction, and model averaging to achieve state-of-the-art performance on standard benchmarks, notably MiniF2F and PutnamBench, with substantially reduced model and compute requirements compared to prior systems. The approach demonstrates that high-level formal reasoning can be achieved without resorting to extremely large models or excessive inference budgets.

Framework Innovations

Verifier-Guided Self-Correction

A central innovation is the integration of Lean compiler feedback into the proof generation loop. After an initial proof attempt, the model receives error messages from the Lean verifier, which are then incorporated into the input for subsequent rounds of proof repair. This iterative self-correction process enables the model to diagnose and fix errors in its own outputs, substantially improving synthesis accuracy and sample efficiency. The ablation studies confirm that explicit error messages and retention of previous CoT traces are critical for effective revision.

Scaffolded Data Synthesis

The training pipeline is augmented by scaffolded data synthesis, which generates synthetic problems of varying difficulty to provide dense learning signals. Two complementary strategies are employed:

Formal-based synthesis: When a proof attempt fails, the $extract\_goal$ tactic in Lean is used to extract unsolved subgoals, which are then formalized as new training statements. Negations of these statements are also included to teach the model to recognize false propositions.
Informal-based synthesis: LLMs are prompted to generate simpler subproblems or harder variants in natural language, which are subsequently formalized and filtered for correctness and non-triviality. This process is highly parallelized and leverages majority voting for quality control.
Figure 1: The informal-based scaffolded data synthesis pipeline, comprising statement generation, formalization and quality checking, and negation/difficulty filtering.

Model Averaging

To counteract diversity collapse in later training stages, model averaging is applied after both SFT and RL. The final model parameters are computed as $(1-\alpha)\theta_0 + \alpha\theta$ , where $\theta_0$ is the base model and $\theta$ is the fine-tuned model. Empirical results show that this technique improves pass@N metrics by enhancing output diversity, especially in the context of self-correction.

Training Pipeline

The overall workflow consists of sequential SFT, scaffolded data synthesis, RL, and model averaging. The process is multi-stage:

Initial SFT on a dataset generated by DeepSeek-Prover-V2.
Annotation and incorporation of self-correction data.
Scaffolded data synthesis to further expand the training set.
RL with a hybrid GRPO-based algorithm, optimizing for both whole-proof generation and self-correction.
Model averaging at each stage to maximize diversity and generalization.
Figure 2: The complete model training workflow, including SFT, scaffolded data synthesis, RL, and model averaging.

Empirical Results

Benchmark Performance

Goedel-Prover-V2-32B achieves 88.1% pass@32 on MiniF2F, rising to 90.4% with self-correction, outperforming DeepSeek-Prover-V2-671B (82.4%) and Kimina-Prover-72B (84.0%) at a fraction of the parameter count. The 8B variant matches or exceeds the performance of much larger models. On PutnamBench, the 32B model solves 86 problems under pass@184 with self-correction, compared to DeepSeek-Prover-V2's 47 solves under pass@1024.

Figure 3: Performance of Goedel-Prover-V2 on different benchmarks under pass@32.

Sample Efficiency and Scaling

Goedel-Prover-V2 demonstrates strong sample efficiency, achieving high pass@N with minimal inference overhead. Scaling analysis shows that performance gains persist across increasing inference budgets, with self-correction providing consistent improvements. Extended context and additional revision iterations further boost accuracy, with pass@32 reaching 92.7% on MiniF2F using a 128k token context and 5 rounds of revision.

RL and Model Averaging Effects

Systematic evaluation of RL steps and model averaging ratios reveals that pass@1 increases with RL progression, while pass@N is maximized at an optimal averaging coefficient. Correction settings benefit more from RL and averaging, indicating that self-correction is particularly effective when combined with diverse model outputs.

Data Curation and Benchmarking

The formalization pipeline is rigorously validated, outperforming previous autoformalizers in both syntactic and semantic fidelity. MathOlympiadBench is introduced as a new benchmark, with careful human verification to ensure alignment between informal and formal statements. Distributional analysis confirms broad coverage across mathematical domains.

Figure 4: Distribution of problems in MathOlympiadBench by category.

Theoretical and Practical Implications

The results demonstrate that formal theorem proving can be advanced without reliance on massive models or proprietary infrastructure. The integration of verifier-guided self-correction with long CoT reasoning establishes a scalable paradigm for formal mathematics, enabling efficient, high-accuracy proof synthesis. The scaffolded data synthesis pipeline provides a blueprint for curriculum learning in formal domains, and model averaging offers a practical solution to diversity collapse in RL and SFT.

Future Directions

Potential avenues for further research include:

Extending self-correction to multi-turn RL and tool-use frameworks for even greater sample efficiency.
Exploring retrieval-augmented generation and hybrid proof search strategies to further improve correctness and generalization.
Applying the scaffolded synthesis and self-correction framework to other formal systems (e.g., Coq, Isabelle).
Investigating amortized inference-time repair strategies to optimize compute-resource scaling.

Conclusion

Goedel-Prover-V2 sets a new standard for open-source formal theorem proving, achieving state-of-the-art results with modest model sizes and compute budgets. The combination of scaffolded data synthesis, verifier-guided self-correction, and model averaging enables efficient, high-accuracy proof generation. The open release of models, code, and data is poised to accelerate progress in formal reasoning and provide a robust platform for future innovation in AI-driven mathematics.

Markdown