Lean Proof Corpora for Formal Mathematics
- Lean proof corpora are large, structured datasets of Lean formal proofs, including mathlib, LEAN-GitHub, synthetic NL-to-Lean datasets, and LLM-generated conjectures.
- They employ advanced extraction methodologies—such as AST traversal and metaprogramming APIs—to capture detailed proof structures, dependencies, and metadata.
- These corpora enable effective machine learning applications like rapid premise selection and neural theorem synthesis, enhancing automated reasoning research.
Lean proof corpora are large, structured datasets composed of formalized mathematical theorems and their proofs represented in the Lean proof assistant. These corpora serve as both repositories of verified mathematical knowledge and as critical training and benchmarking resources for automated reasoning, machine learning-driven theorem proving, and retrieval-augmented generation systems. Canonical examples include the Lean mathematical library (mathlib), large-scale extractions from public Lean codebases, synthetic corpora constructed via NL-to-Lean translation, and datasets of novel conjectures and proofs generated automatically via LLMs.
1. Major Varieties and Composition of Lean Proof Corpora
Several distinct categories of Lean proof corpora are prominent in current research:
- Community-driven foundational libraries: The prime example is mathlib, Lean’s de facto standard library for formal mathematics. As of late 2019, mathlib comprised 140,085 lines of Lean code with 34,168 declarations, distributed across algebra, topology, analysis, logic, set theory, and more (Community, 2019). The library uses a dependently typed foundation, classical axioms, and extensive automation, making it a comprehensive resource for both mathematics and programming.
- Public codebase aggregations: The LEAN-GitHub corpus extracts formal items from nearly all Lean 4 repositories on GitHub, yielding 28,597 theorems with tactic proofs and 218,866 proof commands across 6,352 compiled files, with coverage from undergraduate to Olympiad-level mathematics (Wu et al., 2024). This corpus is particularly notable for its breadth, extracting data from domains underrepresented in mathlib.
- Synthetic formalization datasets: Efforts to mass-produce Lean statements by translating the Google MATH dataset (≈50–100k problems across 56 math subfields) yield corpora of theorem stubs for use in retrieval-augmented LLM pipelines. Here, each entry is a single Lean theorem stub, typically marked with
:= by sorryand accompanied by dense text embeddings but no structured dependency information (Zayyad et al., 2024). - Automatically generated conjectures and proofs: New theorems and proofs created in Lean by LLM-driven conjecture–proving pipelines, such as the Conjecturing-Proving Loop (CPL), result in “proof corpora” emphasizing novelty and in-context learning. The CPL corpus contains 269 verified Lean theorems in topology, each with a full tactic-level proof and detailed metadata (Kasaura et al., 16 Sep 2025).
Comparative Table of Sample Lean Corpora
| Corpus | Theorems | Proof content | Structure/Metadata |
|---|---|---|---|
| Mathlib (Community, 2019) | 34,168 | Fully proved, layered | Type class hierarchy |
| LEAN-GitHub (Wu et al., 2024) | 28,597 | Full tactic traces | repo/file/proof meta |
| FL-RAG (Zayyad et al., 2024) | ~50k–100k (implied) | Stubs (by sorry) |
Embeddings, no deps |
| CPL (Kasaura et al., 16 Sep 2025) | 269 | Full tactics (per file) | Meta: steps, deps |
2. Extraction Methodologies and Data Schemas
Extraction and preprocessing vary by corpus:
- Mathlib and LEAN-GitHub: Extraction involves traversing Lean projects’ environments or ASTs using Lean’s metaprogramming API. In mathlib, constants and their proof terms are swept using
Lean.Environment.foldland traversed viaExprwalks to collect all used premises, while LEAN-GitHub additionally records step-wise tactic scripts, intermediate goal states, and normalizes declarations for deduplication (Piotrowski et al., 2023, Wu et al., 2024). - Synthetic translation sets: For datasets such as FL-RAG, translation from natural language to Lean is performed by fine-tuning LLMs on (NL, Lean) pairs, passing Google MATH problems through a translation prompt, and appending
:= by sorryas a stub proof. Outputs are stored as plain text Lean files and embedded as dense vectors for retrieval, without fine-grained dependency extraction (Zayyad et al., 2024). - LLM-generated conjecture/proof corpora: The CPL pipeline creates proofs incrementally, generating and verifying conjectures, and maintaining detailed metadata (proof length, step count, dependencies, file structure) per theorem (Kasaura et al., 16 Sep 2025).
3. Feature Engineering and Machine Learning Readiness
Lean proof corpora are engineered not merely as archives but as structured learning datasets:
- Feature extraction: Mathlib-based ML systems featurize theorem statements using sparse binary encodings over asserted constant names, AST bigrams, and trigrams. Features are weighted by rarity across the corpus:
where is the corpus and is the set of theorems containing feature . The similarity metric for -NN is a weighted Jaccard (Piotrowski et al., 2023).
- Proof-step granularity: LEAN-GitHub and CPL datasets capture full sequences of tactic commands and the evolving proof state, enabling state-action supervised learning and step-wise proof synthesis (Wu et al., 2024, Kasaura et al., 16 Sep 2025).
- Premise labeling: Labels such as “all” (all proved dependencies), “source” (lemmas explicitly used in the source script), and “math” (lemmas from whitelisted libraries) allow experiments varying recall and relevance in proof retrieval (Piotrowski et al., 2023).
- Embeddings and retrieval: FL-RAG encodes each Lean statement as an embedding via text-embeddings-ada-002 and uses a dense vector index for nearest-neighbor retrieval (Zayyad et al., 2024).
4. Practical Applications
Lean proof corpora underpin a range of research and engineering applications:
- Machine-learned premise selection: The mathlib corpus feeds fast, in-prover ML models (custom random forests and -NN) to rank useful premises, yielding rapid and effective suggestions via the
suggest_premisestactic. Empirical results show Cover scores of 0.29 for random forests and 0.25 for -NN with names and bigrams (Piotrowski et al., 2023). - Formal language knowledge retrieval for LLMs: Embedding-based retrieval from large Lean corpora allows concatenation of relevant theorems to natural language queries, resulting in a significant 19 percentage point improvement (from 54% to 73% correctness) on the “hard” split of Google MATH (Zayyad et al., 2024).
- Training and evaluating neural theorem provers: Large, diverse proof corpora (LEAN-GitHub, mathlib) enable fine-tuning LLMs for proof synthesis. Models trained on these datasets achieve state-of-the-art or near-SOTA accuracy on benchmarks such as miniF2F (Lean version: Pass@1 = 48.8%, Pass@64 = 54.5%), ProofNet, and Putnam (Wu et al., 2024).
- Conjecture generation and in-context learning: Automatically constructed corpora (e.g., CPL) provide empirical evidence that context-inclusion of previous proofs in the prompt substantially increases the success of generating and verifying new theorems. For the key α-open intersection theorem, success rates reached ≈60% with context, 0% without (Kasaura et al., 16 Sep 2025).
5. Structure, Metadata, and Accessibility
Corpus data formats and structure are tailored to research needs and integration requirements:
- JSON and vector formats: Large-scale corpora (LEAN-GitHub) use line-delimited JSON with fields for proof, file origin, tactics, and goals. Synthetic NL→Lean sets use vector database indices and plain Lean files for statements (Wu et al., 2024, Zayyad et al., 2024).
- Proof script granularity: Human and synthetic corpora increasingly provide both statement-level and tactic-level decomposition, supporting stepwise proof prediction and reward signals for RL approaches (Wu et al., 2024, Kasaura et al., 16 Sep 2025).
- License and access: Datasets such as LEAN-GitHub are openly available under permissive licenses and distributed via platforms such as Hugging Face, facilitating reproducibility and downstream application (Wu et al., 2024).
6. Corpus Size, Diversity, and Impact on Downstream Tasks
Empirical studies confirm the significance of both corpus size and diversity:
- Scaling effects: Larger and more diverse corpora yield better coverage of symbol combinations and greater representation of rare but discriminative features, directly improving premise selection and decision-tree learning. As mathlib and public code corpora grow, both the variety and utility of features increase (Piotrowski et al., 2023, Wu et al., 2024).
- Benchmarking and ablations: Incorporating diverse human- and auto-formalized data leads to measurable accuracy gains in model evaluations (e.g., LEAN-GitHub addition: +11–15% on miniF2F) (Wu et al., 2024).
- Domain transfer: Corpora covering broad mathematical domains support higher generalizability. For instance, FL-RAG’s largest correctness improvements occur in symbolic-reasoning tasks (algebra, calculus), while domain-restricted synthetic corpora (CPL in general topology) serve as test beds for studying in-context proof strategies (Zayyad et al., 2024, Kasaura et al., 16 Sep 2025).
- Proof quality metrics: Step count, token count, dependency depth, and re-proving rates characterize corpus contents. The CPL corpus reports average proof lengths of 26 tactic steps (32 lines), with re-proving rates of 99% with context (Kasaura et al., 16 Sep 2025).
7. Integration and Future Directions
Lean proof corpora are increasingly leveraged in hybrid workflows:
- Embedded training and inference: Systems such as machine-learned premise selection operate natively within Lean, streamlining integration and user interactivity (Piotrowski et al., 2023).
- Retrieval-augmented generation: Corpus design for LLM-augmented reasoning experiments favors stubs (for mass coverage in FL-RAG) or richly annotated records (as in LEAN-GitHub and CPL) to support various forms of retrieval and context injection (Zayyad et al., 2024, Wu et al., 2024).
- Ongoing expansion: Future efforts target inclusion of Lean 3 repositories ported to Lean 4, normalization and extraction of all definitions and theories for improved lemma retrieval, and the blend of human-authored, benchmark, and synthetic proofs to cover a wider spectrum of mathematical knowledge and proof techniques (Wu et al., 2024).
A plausible implication is that the development of more comprehensive, richly annotated Lean proof corpora—encompassing both human and machine-generated mathematics—will continue to accelerate progress in both neural and symbolic theorem proving, methodology benchmarking, and integration of formal mathematics into downstream computational tasks.