Seed-Prover: Deep and Broad Reasoning for Automated Theorem Proving

Published 31 Jul 2025 in cs.AI and cs.CL | (2507.23726v2)

Abstract: LLMs have demonstrated strong mathematical reasoning abilities by leveraging reinforcement learning with long chain-of-thought, yet they continue to struggle with theorem proving due to the lack of clear supervision signals when solely using natural language. Dedicated domain-specific languages like Lean provide clear supervision via formal verification of proofs, enabling effective training through reinforcement learning. In this work, we propose \textbf{Seed-Prover}, a lemma-style whole-proof reasoning model. Seed-Prover can iteratively refine its proof based on Lean feedback, proved lemmas, and self-summarization. To solve IMO-level contest problems, we design three test-time inference strategies that enable both deep and broad reasoning. Seed-Prover proves $78.1\%$ of formalized past IMO problems, saturates MiniF2F, and achieves over 50\% on PutnamBench, outperforming the previous state-of-the-art by a large margin. To address the lack of geometry support in Lean, we introduce a geometry reasoning engine \textbf{Seed-Geometry}, which outperforms previous formal geometry engines. We use these two systems to participate in IMO 2025 and fully prove 5 out of 6 problems. This work represents a significant advancement in automated mathematical reasoning, demonstrating the effectiveness of formal verification with long chain-of-thought reasoning.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a lemma-centric, iterative proof generation method that uses formal Lean verification to achieve state-of-the-art performance on mathematical benchmarks.
It employs multi-stage reinforcement learning and diverse prompting to refine proofs iteratively, greatly improving accuracy and efficiency in complex theorem proving tasks.
The integration of Seed-Geometry overcomes Lean’s geometry limitations, providing a neuro-symbolic engine that speeds up proof generation and enhances domain coverage.

Seed-Prover: Deep and Broad Reasoning for Automated Theorem Proving

Introduction and Motivation

The paper introduces Seed-Prover, a formal reasoning system that leverages LLMs for automated theorem proving (ATP) in Lean 4, with a particular focus on deep and broad mathematical reasoning. The work addresses the limitations of natural language-based LLMs in mathematical domains, where the absence of verifiable supervision signals impedes reinforcement learning (RL) and reliable proof generation. By utilizing formal languages such as Lean, Seed-Prover enables precise, compiler-verified feedback, facilitating effective RL and scalable proof search. The system is complemented by Seed-Geometry, a neuro-symbolic geometry engine designed to overcome Lean's historical deficiencies in geometry support.

System Architecture and Methodology

Lemma-Style Whole-Proof Generation

Seed-Prover departs from prior step-level and whole-proof LLM provers by adopting a lemma-centric proof paradigm. Instead of generating a monolithic proof, the model first proposes and proves intermediate lemmas, which are then composed to construct the main theorem. This modular approach enables independent verification, reuse, and combination of lemmas across different inference trajectories, enhancing both proof robustness and search efficiency.

Figure 1: An example of whole proof and lemma-style proof in Lean 4.

The lemma pool, a central data structure, stores all generated lemmas, their proofs, dependencies, and difficulty metrics. This facilitates retrieval, sampling, and recombination during iterative proof refinement.

Seed-Prover employs an iterative refinement loop, where proof attempts are repeatedly updated based on Lean compiler feedback, previously proved lemmas, and self-summarization. The system is trained to propose conjectures—potentially useful properties or subgoals—prior to attempting the main proof, enabling broad exploration of the problem space. This conjecture pool is dynamically expanded and filtered based on proof success rates and semantic relevance.

Multi-Stage RL Training and Diverse Prompting

The model is trained via multi-stage, multi-task RL, using VAPO as the RL backbone. The reward structure is binary, based on successful formal proof verification, with additional penalties to enforce lemma-first proof structure. The training corpus is a mixture of open-source and in-house formalized problems, augmented with easier variants generated by the model itself. Prompts are diversified to include natural language hints, failed attempts, summaries, and Lean feedback, enhancing the model's adaptability and robustness.

Test-Time Inference Strategies

Seed-Prover introduces a three-tiered inference strategy to balance depth and breadth of reasoning, adapting to problem difficulty and available computational budget:

Light Setting: Iterative refinement of single-pass proofs, leveraging Lean feedback and self-summarization. Each attempt is refined up to 8–16 times, yielding significant improvements over naive Pass@ $k$ sampling.
Medium Setting: Nested refinement, where difficult lemmas generated during outer refinement are themselves refined using the light setting. This enables handling of lengthy, structurally complex proofs.
Heavy Setting: Large-scale conjecture generation and proof attempts, with thousands of conjectures proposed and filtered. The lemma pool is populated with nontrivial facts, which are then integrated into the main proof using the medium setting.
Figure 2: The workflows of single-pass whole proof generation, light, and medium inference settings.

Figure 3: The workflow of heavy inference setting.

Seed-Geometry: Fast and Scalable Geometry Reasoning

Seed-Geometry is a neuro-symbolic geometry engine that addresses Lean's lack of geometry support. It features:

An extended domain-specific language (DSL) for concise geometric constructions, including composite actions for common but complex constructions.
A C++ backend for the reasoning engine, yielding a 100x speedup over previous Python implementations.
A high-performing Seed-family LLM, trained on 38B tokens of geometry data, using step-by-step beam search in a distributed setup.
Efficient, distributed search with asynchronous reasoning and inference, supporting large-scale problem generation and solution.

Empirical Results

Seed-Prover and Seed-Geometry achieve state-of-the-art results across multiple formal mathematics benchmarks:

IMO 2025: Fully proved 5 out of 6 problems, with 4/6 within the official contest deadline.
Past IMO Problems: 78.1% success rate (121/155), with strong performance across all difficulty levels and subject areas.
MiniF2F: 100% on MiniF2F-valid and 99.6% on MiniF2F-test, surpassing previous SOTA by a significant margin.
Figure 4: Growth in MiniF2F-Test performance over time.
PutnamBench: 331/657 problems solved, a nearly 4x improvement over previous SOTA.
CombiBench: 30% success rate, tripling the previous best, though combinatorics remains a relative weakness.
MiniCTX-v2: 81.8% success rate, demonstrating strong generalization to context-rich, real-world formalization tasks.

Seed-Geometry outperforms AlphaGeometry 2 on both standard IMO and IMO shortlist geometry problems, solving 43/50 and 22/39 respectively, and solves the IMO 2025 geometry problem in under 2 seconds.

Implementation Considerations and Trade-offs

Resource Requirements: The heavy inference setting is computationally intensive, requiring days of distributed inference and large-scale beam search. The light and medium settings are more tractable, completing in hours.
Scalability: The modular lemma pool and distributed search architecture enable scaling to harder problems and larger search spaces, but memory and compute constraints remain significant for the heaviest settings.
Formal Verification: By relying on Lean's compiler for proof checking, the system ensures correctness and avoids the pitfalls of unverifiable natural language proofs.
Prompt Engineering: The diverse prompting strategy is critical for robustness but increases the complexity of both training and inference pipelines.
Geometry Support: The integration of Seed-Geometry is essential for full coverage of mathematical domains, as Lean's native geometry capabilities are insufficient for IMO-level problems.

Theoretical and Practical Implications

The results demonstrate that formal language-based LLMs, when combined with lemma-centric reasoning, iterative refinement, and scalable search, can achieve high levels of performance on challenging mathematical benchmarks. The modularity of lemma-style proofs facilitates knowledge reuse and compositionality, which are essential for scaling ATP to more complex domains. The success of Seed-Geometry highlights the importance of domain-specific neuro-symbolic systems for areas where formal libraries are underdeveloped.

The strong empirical results—particularly the ability to fully prove 5/6 IMO 2025 problems and saturate MiniF2F—underscore the viability of LLM-based formal provers as practical tools for mathematical research and education. The system's reliance on formal verification ensures reliability and trustworthiness, addressing a key limitation of natural language-based approaches.

Future Directions

Potential avenues for further research include:

Integration of Natural Language and Formal Reasoning: Bridging the gap between informal mathematical discourse and formal proof generation.
Automated Formalization: Extending the system to automatically translate natural language problem statements into formal Lean code.
Open Conjectures: Applying the system to open mathematical problems, leveraging the compositionality and scalability of the lemma pool.
Resource Optimization: Developing more efficient search and refinement strategies to reduce computational overhead, particularly in the heavy inference setting.
Enhanced Domain Coverage: Expanding the formal libraries and neuro-symbolic engines to cover additional mathematical domains beyond geometry.

Conclusion

Seed-Prover and Seed-Geometry represent a significant advancement in automated formal reasoning, combining LLMs, formal verification, and scalable search to achieve state-of-the-art performance on a range of mathematical benchmarks. The lemma-style, whole-proof paradigm, iterative refinement, and domain-specific neuro-symbolic engines collectively enable deep and broad reasoning capabilities. The demonstrated results on IMO 2025 and other benchmarks establish a new standard for LLM-based ATP systems, with substantial implications for the future of AI-assisted mathematics.

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

Seed‑Prover: Deep and Broad Reasoning for Automated Theorem Proving — Explained Simply

What is this paper about?

This paper shows how a team built two smart systems that can solve very hard math problems (like those from the International Mathematical Olympiad, IMO) by writing computer-checked proofs. The two systems are:

Seed‑Prover: an AI that writes full proofs in a precise math language called Lean.
Seed‑Geometry: a special tool that focuses on geometry problems.

The big idea is to combine long, careful thinking (like a detailed plan) with strict, automatic checking (so there are no mistakes). This lets the AI handle problems that are far too tricky to judge in everyday words.

What questions were the authors trying to answer?

In simple terms:

How can we teach AIs to write correct math proofs when normal sentences are too vague to check automatically?
Can we get better results by breaking big proofs into smaller “mini‑proofs” (called lemmas) and improving them step by step?
Can we make a geometry engine that fills in helpful constructions (like extra lines or points) and proves geometry results fast?
If we let the AI “think” longer and in smarter ways at test time, does it solve more problems?

How did they do it?

They used a few key ideas and tools. Here are the tricky parts explained with everyday analogies:

Formal proofs in Lean: Think of Lean as a super strict math teacher who checks every tiny step of your solution instantly. If Lean accepts your proof, it’s guaranteed correct.
Lemma‑style proving (building with Lego bricks): Instead of trying to solve the whole problem in one go, Seed‑Prover first proves small, useful facts (lemmas). These lemmas are like Lego pieces—easy to reuse, combine, and track. The system keeps a “lemma pool” (a library) with names, statements, how hard they were, and how they depend on each other.
Iterative refinement (fix it and try again): When Lean points out errors (like syntax mistakes or missing steps), the model fixes them and refines the proof, several times if needed. It also writes quick summaries for itself to stay organized.
Conjecture proposing (smart “what‑if” guesses): Before diving deep, the model brainstorms many possible properties that might be true (for example, “maybe this function is one‑to‑one,” or “maybe the sequence is periodic”). It then tries to prove or disprove these. The ones that succeed become extra Lego pieces (lemmas) in the pool.
Test‑time scaling (giving the AI time to think):
- Light: Make a proof attempt and refine it several times using Lean’s feedback.
- Medium: Like Light, but also pause to separately prove any hard lemmas the main attempt created.
- Heavy: Start by generating thousands of conjectures, try to prove lots of small facts, collect the best lemmas, then finish the main proof using those lemmas.
Seed‑Geometry (a geometry problem solver):
- Forward‑chaining engine: Imagine a rulebook like “if A and B are true, then C must be true.” The engine keeps applying all matching rules to discover everything it can from the diagram until nothing new appears.
- Better language for constructions: Instead of using long, step-by-step “ruler-and-compass” instructions, they use compact “combo moves” (like “isogonal conjugate” or “exsimilitude center”). This makes problem descriptions short and easier for the AI and engine.
- Much faster backend: Rewriting the engine in C++ made it about 100× faster, which is crucial when exploring lots of possibilities.
- Beam search (maze exploration): The system explores multiple promising paths at once, keeping the most likely ones, like trying several routes in a maze and expanding the best ones.
Training (practice with rewards): The AI learns by reinforcement learning (RL): it gets a reward if Lean accepts the proof (1) and no reward if it fails (0). During training, they sometimes include hints, previous failed attempts, summaries, and Lean’s error messages in the prompt so the model learns to use all kinds of help. They also gradually increase problem difficulty and output length.

What did they find?

The systems performed extremely well on many tough benchmarks:

IMO 2025: During the actual contest, they fully solved 4 out of 6 problems by the deadline; afterward, they reached 5 out of 6.
Past IMO problems (formalized): Seed‑Prover solved 78.1%.
MiniF2F (a popular formal math benchmark): Near 100% solved.
PutnamBench (undergraduate math): 331 out of 657 problems—far higher than previous systems.
CombiBench (combinatorics): 30%—better than earlier methods, though combinatorics remains challenging.
MiniCTX‑v2 (context-heavy, real-world formalization): 81.8%.

For geometry, Seed‑Geometry outperformed earlier engines:

It solved more IMO geometry problems (e.g., 43 on the IMO-AG-50 list) than the previous best system.
It also set a new state of the art on difficult IMO shortlist geometry problems.
It solved the IMO 2025 geometry problem in about 2 seconds (after the problem was formalized).

Why is this important? Because all these results are checked automatically by Lean, they aren’t guesses. They’re genuinely correct proofs—a big step toward trustworthy AI reasoning in math.

Why does this matter?

Reliable reasoning: Formal proofs are like airtight arguments. This reduces the risk of AI “sounding convincing” but being wrong.
Solving really hard problems: With step-by-step feedback, lemma libraries, and smart test-time “thinking,” the AI can tackle problems even expert humans find tough.
Geometry boost: Many proof assistants don’t handle geometry easily. Seed‑Geometry fills that gap, showing how specialized engines can expand what AIs can do.

What could this lead to?

Better math tools: Imagine homework helpers or study aids that don’t just give answers but produce guaranteed‑correct proofs.
Stronger software checking: Formal reasoning can also verify computer programs and systems—important for safety and reliability.
Research assistance: As these systems improve, they could help explore new math, organize big libraries of lemmas, and maybe even assist with open problems.

In short, the paper shows that combining long, careful thinking with strict, automatic checking can make AI provers both powerful and trustworthy—and that’s a big deal for the future of math and beyond.

View Paper Prompt View All Prompts

Open Problems

We found no open problems mentioned in this paper.

Continue Learning

Authors (36)

First 10 authors:

Collections

Tweets

Seed-Prover: Deep and Broad Reasoning for Automated Theorem Proving

Summary

Seed-Prover: Deep and Broad Reasoning for Automated Theorem Proving

Introduction and Motivation

System Architecture and Methodology

Lemma-Style Whole-Proof Generation

Iterative Proof Refinement and Conjecture Proposing

Multi-Stage RL Training and Diverse Prompting

Test-Time Inference Strategies

Seed-Geometry: Fast and Scalable Geometry Reasoning

Empirical Results

Implementation Considerations and Trade-offs

Theoretical and Practical Implications

Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Seed‑Prover: Deep and Broad Reasoning for Automated Theorem Proving — Explained Simply

What is this paper about?

What questions were the authors trying to answer?

How did they do it?

What did they find?

Why does this matter?

What could this lead to?

Open Problems

Continue Learning

Related Papers

Authors (36)

Collections

Tweets

YouTube

HackerNews

alphaXiv