Mathematicians in the age of AI
Abstract: Recent developments show that AI can prove research-level theorems in mathematics, both formally and informally. This essay urges mathematicians to stay up-to-date with the technology, to consider the ways it will disrupt mathematical practice, and to respond appropriately to the challenges and opportunities we now face.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
What is this paper about?
This essay explains how fast AI is learning to do real mathematics, including writing and checking research-level proofs. The author, a mathematician who helps run a new institute on computer‑aided reasoning, argues that mathematicians should keep up with these tools, think carefully about how they will change math, and lead the way in using them well.
The main questions the paper asks
- What can today’s AI really do in mathematics—both in writing human‑style arguments and in producing computer‑checked proofs?
- How will this change the everyday work of mathematicians, teaching, and the reasons universities support math?
- What problems could this cause (credit, motivation, jobs), and how can the math community respond wisely?
How the author looks at the topic
The paper is an essay, not a lab experiment. It surveys recent progress and tells two detailed stories (“case studies”) from 2024–2026 to show what AI can already do.
To make the stories clearer, here are a few terms in everyday language:
- Formal proof: A proof written in very precise “code‑like” math that a computer can check step by step. Think of it like compiling a program: if it compiles, it’s correct.
- Informal proof: The usual human‑written explanation you see in math papers—clear to experts, but not checked by a computer.
- Proof assistant (like Lean) and libraries (like Mathlib): Software and shared “toolboxes” that help you build and check formal proofs.
- Automated reasoning/agents: Programs that search for proofs, sometimes mixing symbolic logic with language‑model “thinking.”
- “Drive‑by proving”: When an outside group swoops in, uses lots of computing power to finish a proof, and announces a splashy result without really being part of the original team’s process.
The two case studies:
- Formal proving story: A team began formally verifying a famous result about packing spheres in 8 dimensions (based on Maryna Viazovska’s work). They carefully planned and built lots of supporting code. Later, an AI company used a proving agent (“Gauss”) to complete the formal proof quickly (and then used it to finish the 24‑dimensional case too). This raised questions about credit, collaboration, and whether the AI‑generated code would be useful and well‑structured for future work. After discussion, the company agreed to collaborate openly and improve the code with the team.
- Informal proving story: Eleven mathematicians posted ten fresh research‑level problems and asked AI systems to try. Early public chatbots did poorly, but specialized agents did much better. One OpenAI model produced a “beautiful” correct solution to one problem; DeepMind’s agent “Aletheia” solved six out of ten correctly. This surprised many mathematicians.
The main findings and why they matter
Here are the central takeaways the author emphasizes:
- AI can now prove serious theorems in two ways: by writing human‑style arguments and by producing fully formal, computer‑checked proofs. This was not true just a few years ago.
- Progress is extremely fast. Since late 2022, systems went from making lots of obvious math mistakes to solving Olympiad, Putnam, and even research‑level problems.
- Collaboration and credit need new norms. In the formal sphere‑packing project, much of the real work was planning, organizing, and building reusable libraries. An AI’s “finish” depends on that groundwork. Communities will need fair ways to recognize both the human scaffolding and AI assistance.
- Teaching and jobs could be disrupted. If AI does routine homework and even advanced problem‑solving, we must rethink how we teach. In fields like engineering and data science, students will be expected to use AI, so math courses must train them to use it well—or risk becoming less relevant.
- But mathematics still gives us control. Formal methods and precise reasoning let us demand verifiable, auditable explanations from AI. That keeps humans “in the loop,” able to check, question, and understand results instead of blindly trusting a black box.
What this could mean going forward
The author’s message is not to panic, but to lead:
- Embrace AI as a tool to do better mathematics—faster proofs, bigger theories, and help on very hard problems (think Goldbach, Langlands, or P vs NP).
- Keep humans at the center by using formal checking and clear standards, so results are trustworthy and understandable.
- Teach students to use AI wisely while still building deep intuition, problem‑solving skills, and good taste in mathematics.
- Shape the tools and the culture. Mathematicians should help design, test, and refine AI for math, set norms for credit and collaboration, and build libraries that make future work easier.
In short: AI is already good at math and getting better quickly. If mathematicians engage, set the rules, and use these tools to enhance understanding—not replace it—then mathematics can thrive and have even greater impact on science, engineering, and society.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a single, consolidated list of what remains missing, uncertain, or unexplored, framed to be actionable for future research and community practice:
- Benchmarking: Create rigorous, multi-subfield benchmarks for AI theorem proving that track not only correctness but also readability, modularity, library reuse, maintainability, and human time saved, with open leaderboards and standardized reporting.
- Authorship and credit: Define protocols for credit attribution and authorship in human–AI collaborations (including model/version citation), especially for large formalization repositories and mixed informal–formal workflows.
- Proof quality and refactoring: Develop automated methods to compress, refactor, and modularize AI-generated formal proofs; quantify improvements via library-level metrics (e.g., lemma reuse, API coherence, duplication reduction).
- Formal–informal fidelity: Build tools to verify that formalized statements faithfully capture the original informal theorems (spec-equivalence checks, test suites that detect theorem weakening/trivialization).
- Reproducibility standards: Establish archival requirements for prompts, seeds, model checkpoints, toolchains, library versions, and compute budgets to make AI-assisted proofs reproducible and auditable.
- Project governance: Design governance models to prevent “drive-by proving” harms (e.g., embargo windows, contribution norms, compute fairness rules, escalation paths for conflicts between corporate and academic contributors).
- Supply-chain security: Implement signed artifacts, trusted kernels, CI gates, and regression/spec-drift tests to safeguard formal libraries from malicious or accidental corruption via automated contributions.
- Pedagogy with AI: Run controlled trials of AI-integrated math instruction to determine what preserves conceptual understanding and problem-solving skill; produce assessment designs that reward reasoning over tool use.
- Educational economics: Collect causal evidence on how AI affects math enrollment, course demand, and downstream labor-market outcomes to guide departmental planning and funding models.
- Environmental and cost accounting: Quantify carbon and monetary costs per theorem/proof pipeline; develop compute-efficient proving strategies and cost-aware decision tools for project leaders.
- Legal/IP frameworks: Clarify ownership, licensing, and training-data provenance for AI-generated proofs and libraries; propose community-standard licenses and contributor agreements.
- Challenge suites for research-level math: Maintain standardized, blinded, and reproducible challenge sets that measure insight generation (e.g., key lemmas, new definitions) rather than just existence of a proof.
- Conjecture generation: Create methods and evaluation protocols for AI-driven conjecturing and theory-building (metrics for novelty, truth rate, downstream utility); curate datasets of open problems with resolution timelines.
- Low-resource domains: Devise techniques (transfer learning, curriculum learning, few-shot formalization) enabling AI systems to perform in subfields with sparse formal corpora.
- Human factors: Study how AI-rich workflows affect mathematicians’ motivation, identity, and well-being; test interventions that sustain enjoyment and a sense of agency.
- Mixed-mode tooling: Build IDEs and agents that integrate informal plans, invariants, and proof intent with formal checking; conduct user studies to measure productivity and understanding gains.
- Editorial policy: Define reviewer and journal policies for AI-assisted submissions, including disclosure requirements, independent verification pipelines, and the role of formal artifacts in publication.
- Error risk in informal outputs: Develop automated sanity checks for AI-generated informal proofs (counterexample search, semantic consistency with known results) to curb plausible-but-wrong arguments.
- Provenance in AI outputs: Create mechanisms to trace influences to sources/training data and to attribute reused ideas/lemmas; integrate provenance metadata into proofs and libraries.
- Capability limits: Map which tasks remain hard for current AI (e.g., inventing foundational definitions/frameworks); produce taxonomies and hardness benchmarks to target research investments.
- Cost–benefit in practice: Conduct time–motion and case studies of large formalizations to quantify tedium reduction versus added curation/refactoring cost, informing when and how to deploy AI.
- Shared infrastructure: Specify and build community resources (prover farms, dataset registries, model zoos, open agents) that reduce compute inequities between academia and industry.
- Proof quality metrics: Operationalize “explanatory quality” (clarity, modularity, didactic value) and train automatic metrics that correlate with expert judgments for both formal and informal proofs.
- Library maintenance at scale: Design strategies and automated linting to manage refactoring debt, deduplication, naming conventions, and stable APIs after rapid AI-driven growth.
- Trusted bases under scale: Stress-test proof assistants’ kernels and trusted libraries at AI-driven scales; cross-check key results across multiple systems (Lean/Coq/Isabelle/HOL) to detect latent issues.
- Student assessment: Pilot assessment schemes (oral defenses, authentic projects, collaborative grading rubrics) that integrate AI while measuring individual understanding and reasoning.
- End-to-end certification: Integrate SAT/SMT, CAS, and LLM agents into verifiable pipelines with explicit failure-mode taxonomies and certificates bridging numeric, symbolic, and formal steps.
- Determinism and replay: Standardize practices for making agentic proving runs replayable (seed control, event logs, determinized toolchains) and archivable over time.
- Subfield heterogeneity: Map readiness and impact across areas (PDEs, algebraic geometry, combinatorics, logic) and produce field-specific capability roadmaps and best practices.
- Communication ethics: Develop community guidelines for timing, framing, and crediting in public announcements of AI-assisted results to avoid discouraging contributors and misrepresenting human effort.
Practical Applications
Immediate Applications
The paper highlights concrete ways current AI for mathematics (formal proof assistants, automated reasoning, and neural agents) can be put to work now across sectors.
- Academia — AI‑assisted formalization workflows: Use “blueprint → Lean/Mathlib → AI code suggestions” pipelines to formalize existing results faster, improve libraries, and produce machine‑checkable artifacts for teaching and research. Tools/workflows: Lean + Mathlib, SAT/SMT backends, AI “proving copilots.” Dependencies/assumptions: mature proof libraries, community maintainers, compute budgets, and contributor guidelines that ensure quality and credit.
- Publishing/peer review — Verification‑augmented review: Invite or require machine‑checked formal artifacts for selected theorems, replication of combinatorial searches, or SAT/SMT certificates alongside manuscripts. Tools: proof assistants (Lean/Isabelle/Coq), certificate checkers. Dependencies/assumptions: reviewer expertise/time, journal policy changes, and scope limited to feasible subresults.
- Software/Hardware (EDA, aerospace, automotive, crypto) — Spec and design verification: Embed automated reasoning (SAT/SMT) and proof assistants to verify properties of protocols, circuits, and algorithms; leverage AI agents to draft and discharge lemmas. Tools: SAT/SMT solvers, theorem provers, AI proof agents integrated into CI. Dependencies/assumptions: formal specs exist; integration with existing toolchains; staff skilled in formal methods.
- Operations/Logistics/Networks — Automated search for combinatorial objects: Apply ML + SAT/SMT to find counterexamples, extremal constructions, or optimal schedules/layouts. Tools: solver‑guided search, neural heuristics. Dependencies/assumptions: problems can be encoded cleanly; acceptance of solver certificates as evidence.
- Engineering/Finance/Energy — Neural PDE/parameter search in simulation: Use neural solvers and agentic hyperparameter search to explore regimes, accelerate design cycles, or calibrate models (e.g., option pricing PDEs, materials, CFD). Tools: physics‑informed neural nets, Bayesian search, verification of constraints with symbolic/solver checks. Dependencies/assumptions: well‑posed boundary/initial conditions; validation datasets; guardrails for numerical stability.
- Knowledge management — Curated formal math libraries: Expand Mathlib and related libraries to serve as shared, queryable knowledge bases that link informal literature to formalized results. Tools: library curation workflows; semantic links between informal blueprints and formal code. Dependencies/assumptions: sustained maintainer effort; open licenses; cross‑tool interoperability.
- Education — AI‑integrated curricula and assessment: Redesign problem sets and exams to require explainable, checkable artifacts (formal proofs, solver certificates) and oral/problem‑solving defenses; deploy AI tutors that produce step‑checked reasoning. Tools: Lean‑based coursework, LMS plugins, sandboxed LLMs. Dependencies/assumptions: faculty development; student/device access; academic integrity policies aligned with AI use.
- Research culture — Collaboration and credit protocols for human–AI projects: Adopt governance for “drive‑by proving” scenarios (contribution logging, open development windows, code quality standards) to protect early‑career researchers while enabling AI assistance. Tools: contribution charters, CRediT‑style roles, artifact DOIs. Dependencies/assumptions: community buy‑in; funder and publisher recognition.
- Benchmarking/Evaluation — Domain‑relevant challenge sets: Maintain openly scored challenge problems (like the ten‑problem experiment) to measure research utility of AI systems and steer model development. Tools: curated benchmarks, leaderboards, community audits (e.g., ICARM Zulip). Dependencies/assumptions: high‑quality ground truth; transparent evaluation; responsible disclosure.
- Safety‑critical policy — Checkable artifacts for assurance: Encourage regulators/procurers to request machine‑checkable proofs/certificates of critical properties in AI‑enabled systems (e.g., medical devices, AV control invariants). Tools: proof‑carrying code, certified solver outputs. Dependencies/assumptions: standards bodies and regulators adopt formats; vendors expose specs.
- Open science infrastructure — Institute‑led coordination: Use the ICARM model to host discussions, artifacts, and training that accelerate responsible adoption of AI reasoning tools across mathematics and adjacent fields. Tools: shared forums, artifact repositories, training programs. Dependencies/assumptions: stable funding; inclusive community governance.
- Enterprise/Legaltech/Finance — Rule and contract consistency checking: Encode policies/contracts and use SMT/proof systems to detect inconsistencies or unintended interactions; AI agents assist in drafting and certifying changes. Tools: SMT‑based policy engines, DSLs for contracts. Dependencies/assumptions: formalizable rule sets; change‑management processes.
- Competition/training — AI coaches for problem solving: Deploy near‑Putnam/IMO‑level tutors that propose hints, verify steps formally, and adapt to learner gaps. Tools: LLM agents + proof checkers. Dependencies/assumptions: access to capable models; pedagogy that balances assistance with skill development.
- Funding and hiring — Recognize formalization/tooling as research contributions: Update grant and hiring criteria to credit library building, proof engineering, and artifact curation on par with traditional publications. Tools: artifact metrics, citation practices. Dependencies/assumptions: departmental and agency policy revisions.
Long‑Term Applications
As capabilities scale from “research‑level” problem solving to sustained autonomy and integration, broader transformations emerge.
- All sectors — AI proves deep conjectures with formal certification: Resolution of major problems (e.g., Langlands correspondences, Goldbach, ) drives advances in cryptography, algorithms, and modeling. Tools: hybrid informal/formal proving agents; large formal libraries. Dependencies/assumptions: continued model advances; scalable proof search; community validation.
- Academia/Industry — Autonomous research agents: Systems that propose conjectures, generate informal exposition, produce formal proofs, and run computational experiments end‑to‑end; humans guide agendas and interpret significance. Tools: multi‑agent research stacks tied to proof checkers and databases. Dependencies/assumptions: robust planning, verification, and value alignment; evaluation and credit frameworks.
- Safety‑critical AI — End‑to‑end mathematically verified ML pipelines: Formal specs for data preprocessing, model architectures, training procedures, and safety properties with certified guarantees (robustness, monotonicity, fairness constraints). Tools: differentiable programming with proof annotations; certifiable training objectives. Dependencies/assumptions: tractable formal semantics for stochastic training; scalable certifiers.
- Engineering/EDA/CAD — Math‑aware design automation: Real‑time co‑design tools that synthesize designs and produce correctness/optimality proofs (or tight certificates) as part of the artifact. Tools: embedded provers, constraint solvers, and generative optimizers. Dependencies/assumptions: performance at industrial scales; standard proof interfaces.
- Education (K‑12 to PhD) — Widespread adoption of proof assistants: Reasoning‑centric curricula where students routinely produce machine‑checkable solutions; assessments emphasize understanding and explanation alongside formal artifacts. Tools: classroom‑ready proof UIs, aligned content standards. Dependencies/assumptions: teacher training at scale; infrastructure funding; equitable access.
- Publishing — Formal‑first publication norms: Major theorems routinely published with linked, audited formal proofs in open libraries; journals integrate continuous proof checking and library maintenance credit. Tools: journal–prover CI, artifact portals. Dependencies/assumptions: tool maturity; incentives for upkeep; cross‑assistant interoperability.
- Law/Policy — Authorship, IP, and accountability frameworks for AI mathematics: Clear standards for attribution (human vs. AI), licensing of formal libraries, and responsibility for errors in AI‑assisted results. Tools: provenance tracking, cryptographic signing of artifacts. Dependencies/assumptions: legislative action; harmonized international norms.
- Knowledge infrastructure — Global mathematical knowledge graph: Unified, cross‑assistant ontology connecting informal literature, formal proofs, datasets, and solver certificates to enable semantic search and automated discovery of links. Tools: interoperable kernels/IRs, converters across Lean/Coq/Isabelle, graph query APIs. Dependencies/assumptions: community standards; sustained curation.
- Software supply chain — Marketplaces for verified components: Catalogs of formally verified algorithms, protocols, and models with machine‑checkable proofs integrated into package managers and build systems. Tools: proof‑carrying packages, certification tiers. Dependencies/assumptions: vendor incentives; third‑party certification bodies.
- Cross‑disciplinary research — Math‑mediated AI in sciences and humanities: Routine use of mathematical artifacts to audit AI‑generated analyses in economics, linguistics, and policy modeling, enhancing transparency and trust. Tools: domain DSLs with formal semantics tied to provers. Dependencies/assumptions: translational tooling; domain‑specific formalization.
- Energy/Climate — Proof‑guided AI optimization of critical infrastructure: Grid control, demand response, and market mechanisms designed with AI optimizers constrained by provable safety and efficiency properties. Tools: control‑theory specifications, certifiable optimizers. Dependencies/assumptions: data sharing; regulatory acceptance.
- Workforce/culture — New roles and training paths (proof engineers, math‑ops, library curators): Professional tracks focused on building, verifying, and maintaining mathematical infrastructure that underpins AI‑enabled R&D. Tools: certification programs; career ladders. Dependencies/assumptions: funding lines; recognition in academia and industry.
Glossary
- Aletheia: A specialized AI proving agent developed by Google DeepMind that can autonomously produce mathematical proofs. "A model produced by OpenAI offered a solution to one problem that experts judged to be 'completely correct and also quite beautiful.' The best performance was by a proving agent, Aletheia, developed by Google DeepMind, which produced correct results for six out of the ten problems."
- agentic systems: AI systems that plan and execute multi-step reasoning or actions autonomously to achieve goals, often orchestrating tools and models. "Sometimes this involves using LLMs and agentic systems to write informal mathematical arguments,"
- automated reasoning: The use of algorithms and software to perform logical inference and prove theorems without human intervention. "Automated reasoning tools such as SAT solvers have solved open problems in combinatorics, algebra, number theory, and discrete geometry."
- Big Proof conference: A research meeting focused on large-scale, collaborative, or computer-assisted proofs in mathematics. "Viazovska gave a progress report in a talk at the Big Proof conference at the Isaac Newton Institute in Cambridge, UK, in July 2025,"
- blueprint: In formalization, a detailed informal plan of a proof that guides and structures the subsequent mechanized formal proof. "they wrote a detailed informal blueprint to guide the formalization."
- combinatorial objects: Discrete mathematical structures (such as graphs, sets, or designs) studied in combinatorics and used as witnesses or counterexamples. "They have also been used to find combinatorial objects of interest, including counterexamples to conjectures."
- counterexamples to conjectures: Explicit constructions showing that a proposed general statement (conjecture) is false. "They have also been used to find combinatorial objects of interest, including counterexamples to conjectures."
- discrete geometry: A field studying geometric objects and properties in discrete settings, often with combinatorial methods. "Automated reasoning tools such as SAT solvers have solved open problems in combinatorics, algebra, number theory, and discrete geometry."
- drive-by proving: A practice where an external party quickly completes a proof for publicity or credit without sustained collaboration or integration with a project’s goals. "an act which Matt Ballard has aptly described as a 'drive-by proving.'"
- E8 lattice: A highly symmetric, exceptional lattice in eight dimensions, notable for optimal sphere packing. "a project to formalize Viazovska's proof of the optimality of the lattice for 8-dimensional sphere packing."
- formal checker: A software component that mechanically verifies the correctness of a formal proof or derivation. "leveraging insights from conventional mathematical literature and the strong signal for correctness provided by a formal checker."
- formal library: A curated repository of machine-verified definitions, theorems, and proofs used by proof assistants. "To date, early adopters of formal methods have contributed to a formal library for mathematics, Mathlib,"
- formal methods: Mathematically rigorous techniques and tools for specification, verification, and reasoning about systems and proofs. "To date, early adopters of formal methods have contributed to a formal library for mathematics, Mathlib,"
- formal proof: A proof written in a formal language that can be checked by a machine for correctness. "decided to show off the abilities of its newest proving agent, Gauss, by completing a formal proof of the final result."
- formalization: The process of translating informal mathematical arguments into a formal language suitable for machine checking. "The formalization, on its own, is close to worthless, since the correctness of Viazovska's result was never in doubt."
- Goldbach's conjecture: The famous unsolved claim that every even integer greater than 2 is the sum of two primes. "If AI can help us realize the Langlands program, prove Goldbach's conjecture, and resolve the problem,"
- ICARM: The Institute for Computer-Aided Reasoning in Mathematics, an NSF-backed institute promoting AI and formal methods in math. "it managed to launch the Institute for Computer-Aided Reasoning in Mathematics (ICARM),"
- informal theorem proving: Producing human-readable mathematical proofs in natural language, without formal machine verification. "the mathematics community witnessed a complementary experiment in informal theorem proving."
- International Mathematical Olympiad: The premier international mathematics competition for pre-college students, often used as a benchmark for problem-solving ability. "Last summer, four companies claimed gold-medal performance on the International Mathematical Olympiad competition problems,"
- Langlands program: A broad, influential web of conjectures linking number theory, representation theory, and geometry. "If AI can help us realize the Langlands program, prove Goldbach's conjecture, and resolve the problem,"
- Lean: An interactive theorem prover (proof assistant) and ecosystem for developing machine-checked mathematics. "construct formal proofs whose correctness is verified by a proof-checker like Lean."
- Mathlib: The main community-developed library of formalized mathematics for Lean. "To date, early adopters of formal methods have contributed to a formal library for mathematics, Mathlib,"
- PDEs: Partial differential equations, which involve multivariable functions and their partial derivatives, central in modeling continuous phenomena. "Neural networks are also used to compute solutions to PDEs,"
- proof assistant: Software that helps users interactively build and verify formal proofs. "including formalization and proof assistants, symbolic AI and automated reasoning, and machine learning and neural networks."
- proof-checker: A program that automatically verifies the correctness of a complete formal proof. "construct formal proofs whose correctness is verified by a proof-checker like Lean."
- proving agent: An AI system designed to autonomously plan, generate, and verify mathematical proofs, often coordinating multiple tools. "decided to show off the abilities of its newest proving agent, Gauss,"
- Putnam benchmarks: Benchmark sets derived from the William Lowell Putnam Mathematical Competition, used to assess problem-solving performance. "The best systems are now well on their way to saturating Putnam benchmarks."
- SAT solvers: Algorithms and tools that determine the satisfiability of propositional logic formulas, often used to tackle hard combinatorial problems. "Automated reasoning tools such as SAT solvers have solved open problems in combinatorics, algebra, number theory, and discrete geometry."
- sphere packing: The problem of arranging non-overlapping spheres to maximize density in a given dimension. "a project to formalize Viazovska's proof of the optimality of the lattice for 8-dimensional sphere packing."
- symbolic AI: AI approaches based on explicit symbols and formal logic, as opposed to purely statistical methods. "many of them involving neural and symbolic AI."
Collections
Sign up for free to add this paper to one or more collections.