First Proof

Published 5 Feb 2026 in cs.AI, math.AG, math.CO, math.GT, math.HO, and math.RA | (2602.05192v1)

Abstract: To assess the ability of current AI systems to correctly answer research-level mathematics questions, we share a set of ten math questions which have arisen naturally in the research process of the authors. The questions had not been shared publicly until now; the answers are known to the authors of the questions but will remain encrypted for a short time.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a novel benchmark using unpublished research-level math problems to assess AI problem-solving abilities.
The methodology employs a one-shot prompting protocol on ten diverse, non-public problems from various mathematical fields.
Preliminary results show that advanced AI models struggle with unsolved research cases, highlighting the need for iterative refinement.

Establishing an Objective Framework for Evaluating AI on Research-Level Mathematics

Motivation and Scope

"First Proof" (2602.05192) introduces a structured initiative for assessing AI performance on authentic research-level mathematics problems, departing from artificial or contest-style benchmarks. The paper's central assertion is that it remains unclear whether contemporary AI systems can independently solve genuine research-driven mathematical problems without human oversight. Existing benchmarks often focus on competition-style questions or questions curated for automatic evaluation, neither of which reflect the formulation, ambiguity, and depth intrinsic to the mathematical research process. The aim is to stimulate methodical, transparent advances in measuring and improving the ability of AI systems to contribute to mathematical research workflows.

Problem Set and Methodology

The authors compiled a diverse suite of ten unpublished research-level mathematical problems, each carefully curated from the live research agendas of the author group and spanning domains including algebraic combinatorics, spectral graph theory, algebraic topology, stochastic analysis, symplectic geometry, representation theory, and numerical analysis. Each question is selected based on several criteria:

It was originally solved as a technical lemma or component by one of the authors in the course of their research.
The question had not been posted or discussed in any public forum prior to inclusion, eliminating data contamination from AI training corpora.
The proof of each statement is compact—typically under five pages—enabling interaction with LLMs within their current context limits.
Each participant contributed at most one question, and a collaborative filtering process ensured the clarity and relevance of queries.

The answers to the ten questions are encrypted and scheduled for public release, providing a clear challenge for both researchers and AI developers: solve genuinely new problems in mathematics with no possibility of "memorization" or web-driven retrieval. Furthermore, the experiment invites the community to attempt these questions with current AI systems, collect complete interaction transcripts, and share methodologies and observations.

The authors emphasize the complexity of benchmarking problem-solving abilities in AI:

Research in mathematics is not purely about solution-finding but involves problem formulation, theory development, and proof construction. This first effort restricts itself to the most definitive and least ambiguous stage: proving well-specified statements.
Many previous benchmarks—such as FrontierMath, IMProofBench, and RealMath—either samples small segments of the research process, rely on private datasets, allow for data leakage through web exposure, or structure questions for automatic grading with symbolic answers. "First Proof" distinguishes itself by ensuring genuine novelty of problems, public transparency of questions, and strict prevention of answer leakage.
Human expert evaluation remains integral, as correctness requires assessment of full proofs, and correct answers may admit multiple valid methods or formulations.

Implementation Protocol

Preliminary trials of the ten questions were performed on state-of-the-art LLMs, specifically GPT-5.2 Pro and Gemini 3.0 Deepthink. The authors utilized a one-shot prompting protocol, refraining from manual intervention or repeated queries, thereby preventing human-in-the-loop artifact or prompt engineering from distorting results.

Notably, the preliminary outcome is that the strongest publicly available AI models routinely fail to solve many of these research-level questions in a single attempt. However, the authors hypothesize improved results if iterative refining of prompts or progressive interaction is permitted.

To safeguard against inadvertent data contamination or unintentional answer proliferation, all answers are encrypted and privately stored until a designated release date. Comprehensive data retention policies of AI vendors were analyzed and accounted for in the release protocol.

Implications and Prospects

This initiative represents a substantial contribution in operationalizing AI evaluation against the natural distribution of currently unsolved, unpublished mathematical problems, in contrast to rehashing known results or synthetic queries. Several implications and future directions are highlighted:

The methodology elucidated here can inform the construction of more rigorous and scalable research-level mathematical benchmarks for AI, although producing large numbers of such questions remains challenging due to the inherent rate at which mathematical research progresses.
The paradigm ensures that any AI success indicates genuine problem-solving, rather than information retrieval, thereby delineating progress at the boundary of AI mathematical reasoning.
Future developments could extend the approach to encompass the more creative stages of mathematical research, including problem formation and theory innovation, as technical LLM capabilities improve.
The authors commit to a follow-up with a second set of questions, and are open to private model evaluations prior to their public release, further supporting systematic benchmarking.
The current release protocol—public questions with delayed answers, full transcript sharing, and open invitation for experimentation—maximizes both community engagement and scientific rigor.

Conclusion

"First Proof" (2602.05192) advances the dialogue on AI’s place within mathematical research by establishing a clear, contamination-free set of authentic, research-derived mathematical questions designed for open, transparent testing. The findings to date—consistent inability of large LLMs to autonomously prove non-public, research-level results—underscore both the difficulty of the domain and the necessity of such benchmarks for meaningful AI progress measurement. Anticipated expansions of the initiative, including a second, larger problem set and long-term methodological refinement, bear direct significance for both AI evaluation and mathematical practice.

Markdown Report Issue

Paper to Video (Beta)

All Videos Subscribe on YouTube

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

Overview

This paper is about testing how well today’s AI systems can solve real, research-level math problems on their own. Instead of using contest-style questions or puzzles, the authors collected ten genuine math questions that came up during their own research. They’ve kept the official answers secret for a short time so people can try using AI to solve them without accidentally finding the solutions online.

The title “First Proof” is a playful nod to baking: in bread-making, the “first proof” is when dough is left to rise. Similarly, the authors are letting this idea “rise” in the math and AI community before building a more polished, formal test.

Objectives and Research Questions

The paper focuses on simple but important goals:

Figure out how to fairly test whether AI can solve real math research questions without human help.
Use questions that are truly like what mathematicians work on, not simplified problems designed just for grading.
Avoid “data contamination,” meaning the answers aren’t already on the internet where an AI could just find them.
Learn how people and AI interact when solving such problems, and what good testing rules should look like.

The actual math questions cover many areas of advanced math. You don’t need to know the details; the idea is that they are the kinds of questions professional mathematicians actually face, such as in algebra, geometry, probability, and graph theory.

Approach and Methods

Here’s how the authors set up their test:

They collected ten math questions that arose naturally during their research. Each one has a short, human-made proof (around five pages), but the answers haven’t been posted online.
They encrypted the answers and posted them at https://1stproof.org. The answers will be released on February 13, 2026.
They invited the community to try solving the questions with AI and to share complete transcripts of their attempts. This helps everyone learn what prompts, formats, and grading methods work best.
They allowed AI systems to use outside resources (like internet searches) to simulate real-life conditions—but since the answers aren’t public, the AI can’t just copy a solution.
They did small, preliminary tests using advanced public AI models (like GPT-5.2 Pro and Gemini 3.0 Deepthink) in a “one-shot” manner: they asked each model to produce an answer once, without back-and-forth hints. The goal was to keep things fair and simple.

To reduce the chance of training on their data, they disabled data-sharing settings. The authors also explain that human experts will need to grade the proofs, because the final answers aren’t neat numbers or short expressions—there can be multiple correct proofs or counterexamples.

Main Findings and Why They Matter

While this paper doesn't present final scores or a formal benchmark yet, it shares early observations:

In one-shot tests, top public AI systems struggled with many of these real research questions.
The design of their test reduces the chance that AIs are just “searching the web” for answers, since the answers aren’t online.
The questions are varied and realistic, coming from mathematics as it’s actually done, not from contests. This better reflects creative problem solving in math.
Human grading is necessary because research problems often have more than one correct solution and require reasoning, not just a final number.

This matters because it helps the community understand the current limits of AI in creative, high-level math. It also points the way to building stronger, fairer tests in the future.

Implications and Potential Impact

The project aims to guide how we evaluate and improve AI for math research:

It could lead to a new kind of benchmark: fewer questions, but more realistic, carefully curated, and human-graded.
It may help mathematicians see where AI can be useful today (for tasks like searching papers, checking for mistakes, writing code) and where it still needs improvement (creating and proving novel results).
By sharing experiences from many attempts, the community can learn better prompting strategies, grading standards, and ways to avoid data contamination.
The team plans to release a second set of questions and possibly work with AI labs to test models before public release, helping ensure fair, informative evaluations.

In short, this paper starts a careful, community-driven effort to measure and improve AI’s ability to tackle the kind of deep, original thinking that real math research requires.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a single, consolidated list of concrete gaps the paper leaves unresolved, framed to guide future research and benchmark design.

Benchmark scope and statistical power: With only 10 problems, there is no analysis of how many items are required to detect meaningful differences between models, estimate confidence intervals, or support statistical comparisons across systems and versions.
Difficulty calibration and metadata: The paper does not quantify per-problem difficulty, prerequisites, or domain coverage, nor provide metadata (e.g., topic tags, proof length, technique type) to enable stratified evaluation or difficulty-adjusted scoring.
Grading rubric and inter-rater reliability: No formal criteria for correctness, completeness, rigor, or partial credit are specified, and no plan is provided to measure inter-rater agreement (e.g., kappa) or to resolve disagreements among expert graders.
Solution format specification: The acceptable format for answers (full proof, counterexample, computational verification, outline with lemmas) is not defined, nor are standards for referencing external results, using computer algebra, or including auxiliary computations.
Separation of “search” vs “solve” capabilities: The methodology allows unfettered internet access but does not provide controlled variants (e.g., closed-book/offline settings, restricted search) to disentangle retrieval and synthesis from genuine problem-solving.
One-shot vs interactive protocols: The initial tests were one-shot, but the paper does not define standardized multi-turn interaction protocols, time budgets, or interaction limits, all of which critically affect outcomes and comparability.
Compute and resource controls: There are no controls or reports for inference-time compute, tool access (code execution, CAS), or model-specific capabilities, making results difficult to reproduce or compare fairly across systems.
Model versioning and reproducibility: The paper lacks a pre-registered evaluation plan, version pinning, environment specification, and a protocol for repeated runs to quantify variance across queries and sessions.
Automatic vs human grading: While acknowledging the need for human grading, there is no pathway outlined for partial automation (e.g., proof-checkers, formal verification, unit tests for subclaims), nor criteria for when automation is reliable.
Post-release contamination management: After solutions are released, the paper does not propose mechanisms to maintain benchmark utility (e.g., new blinded sets, staggered releases, provenance checks) or to track data contamination over time.
Participant reporting standards: The request for public transcripts is not accompanied by a standardized logging schema (e.g., prompt, system settings, tool usage, timestamps, iterations) to enable comparable and analyzable submissions.
Baselines and human performance: No human baseline (e.g., time-to-solution for experts vs non-experts) or classical algorithmic baselines are provided to contextualize model performance or to calibrate “research-level” difficulty.
Coverage and representativeness: The selection (one problem per author, five-page proof cap) may bias toward certain subfields and techniques; there is no analysis of how well the set reflects the “true distribution” of contemporary research tasks.
Evaluation dimensions beyond correctness: The paper does not propose metrics for solution quality dimensions (e.g., elegance, generality, robustness, error analysis, explanatory clarity) that matter to mathematicians and research workflows.
Partial progress measurement: There is no scheme to award credit for meaningful partial progress (e.g., reduction to known lemmas, correct intermediate claims, plausible conjectures with evidence) while penalizing gaps and errors.
Time-to-solution and efficiency metrics: The protocol does not collect or report time-to-solution, tool-call counts, or interaction length, missing critical indicators of practical utility and cost-performance trade-offs.
Security and privacy assurances: While noting retention policies, the paper does not implement or verify cryptographic commitments, leak detection, or systematic audits to ensure zero exposure of solutions prior to release.
Licensing and data governance: The paper does not specify licensing for the problem set, expected use constraints, or governance for derivative benchmarks and public leaderboards.
Machine-readable problem formats: No standardized, machine-readable packaging (e.g., JSON with fields for definitions, assumptions, allowed references) is provided to facilitate automated ingestion, prompting, and grading pipelines.
Error taxonomy and failure analysis: There is no plan to categorize model errors (e.g., logical gaps, hallucinated citations, misuse of definitions, invalid inference steps) or to publish systematic failure analyses to guide model improvement.
Generalization and transfer: The paper does not propose methods to assess whether solving one problem generalizes to related problems or whether models can transfer techniques across domains.
Higher-level research tasks: The paper explicitly excludes question formulation and framework development; however, no roadmap or protocol is proposed for evaluating these capabilities in subsequent phases.
Ethical considerations for public experiments: There is no guidance on responsible disclosure (e.g., avoiding toxic prompting, plagiarism checks, citation practices) or on credit assignment for AI-assisted solutions.
Post-hoc training risks: The paper does not address how public sharing of transcripts and solutions might be used for post-training or RLHF by vendors, nor how to prevent overfitting to the benchmark.
Reference integrity and comparability: The Related Work section cites private benchmarks but does not define criteria to compare results across datasets, nor standardize evaluation dimensions for fair cross-benchmark comparisons.
Second-set design commitments: The proposed “second set of questions” is mentioned without a pre-registered plan for sampling, blinding, grading, interaction modes, contamination controls, or long-term maintenance.
Domain-specific constraints: Some problems rely on highly specialized contexts (e.g., Whittaker models, Lagrangian smoothing), yet the paper does not define allowable background references or provide canonical definitions to minimize misinterpretation by models.
Proof length and complexity constraints: Imposing a five-page limit introduces selection bias; the paper does not analyze how proof-length constraints affect task realism or propose variants to capture longer, multi-lemma research workflows.
Leaderboards and reporting: No framework is proposed for public leaderboards, standardized reporting (e.g., per-problem scores, error types), or mechanisms to prevent cherry-picking and ensure fair comparison.

These gaps suggest concrete next steps: pre-register evaluation protocols; define grading rubrics and partial-credit schemes; implement controlled online/offline conditions; specify machine-readable problem formats and logging standards; measure inter-rater reliability; calibrate difficulty; collect time and resource metrics; and design contamination-resilient, renewable test sets with transparent governance.

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

Below are deployable ways to use the paper’s methodology and artifacts today, organized by stakeholder and sector. Each item notes tools/workflows that could emerge and key assumptions/dependencies.

AI evaluation of research-level reasoning (software, academia, policy)
- Use case: Run head-to-head evaluations of AI systems on the paper’s 10 public, unpublished-at-training-time research questions to measure “first-proof” capability under realistic conditions (browsing allowed, human grading).
- Tools/workflows: “FirstProofEval” kit combining standardized prompts, transcript logging, contamination audits (pre/post web search traces), and human grading templates.
- Assumptions/dependencies: Availability of qualified human graders; adherence to the paper’s one-shot vs interactive protocols; careful handling of browsing to avoid leakage from early community posts; legal/ethical compliance for internet use in evaluation.
Procurement tests for R&D AI assistants (government, pharma, finance, aerospace)
- Use case: Adopt the paper’s evaluation protocol as a vendor-neutral test suite when procuring AI tools meant to assist scientific and mathematical research teams.
- Tools/workflows: RFP annex that specifies “unknown-problem” evaluation, transcript disclosure, and contamination reporting; leaderboards with audit trails.
- Assumptions/dependencies: NDAs protecting yet-to-be-released problems; panel of external graders; comparable browsing configurations across vendors.
Classroom modules for AI-augmented proof practice (education)
- Use case: Graduate courses in math/CS integrate the 10 questions to teach prompting strategies, proof verification, and responsible use of AI.
- Tools/workflows: Assignment templates requiring full interaction transcripts, instructor rubrics mirroring the paper’s grading philosophy, and reflection prompts on data contamination.
- Assumptions/dependencies: Instructor time; student access to comparable AI systems; departmental policy on AI usage and disclosure.
Stress-testing automated theorem provers and CAS integrations (software)
- Use case: Evaluate and improve integrations between LLMs and formal systems (Lean/Isabelle/Coq) or CAS tools by attempting formalizations/proofs of the questions’ lemmas.
- Tools/workflows: Harness to translate natural-language statements into formal languages; auto-tactic benchmarking; error localization logs tied to human grading.
- Assumptions/dependencies: Significant formalization effort; community curation of formal statements; permissive licenses for sharing partial formalizations.
Data governance playbook for evaluation sets (policy/compliance)
- Use case: Repurpose the paper’s “unknown-to-the-web” approach to build contamination-resistant evaluation sets in other domains (e.g., law, chemistry, biomedicine).
- Tools/workflows: Pipelines for holdout creation, encryption, delayed release, and transcript-based auditability of external lookups.
- Assumptions/dependencies: Domain experts able to generate/solve holdouts; secure storage and staged disclosure processes.
PCG-based solver for kernelized CP with missing data (software/data science; healthcare, recommender systems, climate)
- Use case: Implement an iterative preconditioned conjugate gradient (PCG) solver that exploits Khatri–Rao/Kronecker and selection-matrix structure for the mode-k RKHS factor update (Question 10’s setup).
- Tools/workflows: Drop-in module for TensorLy, MATLAB Tensor Toolbox, PyTorch/JAX backends; GPU-friendly matvecs using (Z ⊗ K), selection S, and kernel-apply primitives; block-Jacobi/diagonal-plus-Kronecker preconditioners.
- Assumptions/dependencies: SPD kernel K, q ≪ N sparsity of observations, n,r < q, numerically stable kernel application; convergence/regularization tuning.
Heuristic “ε-light” subgraph finders (networking/cloud)
- Use case: Even without a proof (Question 6), deploy spectral/SDP heuristics to identify “light” vertex subsets for load shedding, throttling, or sampling in large networks.
- Tools/workflows: Laplacian-based filters that search for S maximizing PSD margin of εL − L_S; integration into graph analytics pipelines.
- Assumptions/dependencies: No worst-case guarantees yet; rely on empirical validation and fallback safeguards.
Open challenge platform for community benchmarking (community/industry)
- Use case: Host a public challenge with the paper’s protocol (complete transcript sharing, browsing allowed), facilitating transparent comparisons and prompt engineering research.
- Tools/workflows: Kaggle/CF-style platform with encrypted answer vault, submission auditing, and human-grading coordination.
- Assumptions/dependencies: Funding for moderation and grading; anti-contamination safeguards; clear terms for sharing interactions.

Long-Term Applications

These depend on the planned second release, broader adoption, or on the eventual resolution of the specific mathematics problems listed in the paper.

Formalized benchmark and standards for scientific reasoning (AI industry, academia, standards bodies)
- Use case: The planned “second set” evolves into a maintained benchmark with interactivity protocols, rubricized scoring, and contamination audits—usable for capability tracking, model selection, and scientific QA.
- Tools/workflows: NIST-like benchmark specifications; reproducibility harnesses; standardized browse tools with logging; periodic refresh cycles.
- Assumptions/dependencies: Sustainable pipeline of curated, solved-but-unpublished problems; multi-stakeholder governance; funding for graders.
Certification for AI tools used in science policy and procurement (policy/standards)
- Use case: Regulators and public funders require “unknown-problem, human-graded” certifications for AI used in grant reviewing, regulatory science, or critical R&D decisions.
- Tools/workflows: Certification schemes with external proctoring, tamper-proof transcript storage, and post-hoc publishing of problem sets.
- Assumptions/dependencies: Broad consensus on metrics; alignment with privacy/IP rules; interoperability across AI vendors.
Safer training regimes for proof-capable models (AI R&D)
- Use case: Use similar holdout structures to train and evaluate RLHF/RLAIF for proof search without leakage, improving reliability of STEM co-pilots.
- Tools/workflows: Curriculum with synthetic subgoals, decontamination tools, “don’t-guess” penalties, and formal-verification feedback loops.
- Assumptions/dependencies: Strong guardrails to prevent train–test contamination; scalable human feedback; compute budgets.
Generalization to other high-stakes domains (healthcare, law, materials science)
- Use case: Adopt the “first proof” methodology to assess AI on unpublished clinical protocol questions, novel case-law hypotheticals, or unseen synthesis targets in materials.
- Tools/workflows: Domain-specific graders, secure repositories, and structured answer formats that allow partial credit and human verification.
- Assumptions/dependencies: Availability of expert solvers and graders; ethical oversight; domain-adjusted grading schemas.
If specific questions are resolved, downstream sectoral impacts could include:
- Graph ε-light subset theorem (Question 6) — software/energy/telecom
- Applications: Constructive sparsification and throttling algorithms for scalable graph processing, congestion control, and resilient grid/network design.
- Tools: Linear-time or near-linear-time routines integrated into graph libraries; guarantees for PSD-preserving sampling.
- Dependencies: Existence of constructive proofs and algorithmic reductions.
- Additive convolution inequality for roots (Question 4) — control/signal processing/numerics
- Applications: Tighter stability/conditioning bounds for polynomial filtering, robust control design, and root-finding under perturbations.
- Tools: Certified “stability margin” calculators; improved preconditioners for polynomial eigenproblems.
- Dependencies: Proof of inequality and numerically stable estimators of Φ_n.
- Kernelized CP with missing data theory (Question 10, beyond PCG engineering) — healthcare/climate/media
- Applications: Higher-fidelity spatiotemporal imputation (EHRs, climate grids, recommender logs) with RKHS priors; uncertainty-aware forecasting.
- Tools: End-to-end alternating solvers with kernel learning and convergence guarantees; GPU/TPU acceleration.
- Dependencies: Theoretical convergence/identifiability under missingness and kernel choices.
- Determinantal tensor relations and separability test (Question 9) — computer vision/ML
- Applications: Fast tests for multi-view consistency and factor separability; diagnostics for latent-variable models; “Tensor Integrity Checker” SaaS.
- Tools: Polynomial map F packaged as batched kernels; symbolic–numeric hybrid solvers.
- Dependencies: Explicit construction of F with degree independent of n; conditioning analysis.
- ASEP/Macdonald stationary Markov chain (Question 3) — logistics/queueing
- Applications: New queueing-network models with analytically tractable stationary laws; performance tuning in supply chains and data centers.
- Tools: Simulators and parameter estimators leveraging interpolation polynomials’ structure.
- Dependencies: Existence and constructive transition kernels independent of the polynomials themselves for definition.
- Shift-equivalence for the $\Phi^4_3$ measure (Question 1) — computational physics
- Applications: Correctness of MCMC/samplers for renormalized SPDEs under field shifts; variance-reduced proposals in physical simulations.
- Tools: Sampler libraries with invariance checks; diagnostics for measure equivalence.
- Dependencies: Rigorous equivalence criteria and numerically testable surrogates.
- Equivariant slice filtration characterization (Question 5) — formal methods/materials with symmetries
- Applications: Computable invariants for symmetry-preserving systems; analysis tools for topological phases and symmetric controllers.
- Tools: Software for geometric fixed points computations and slice connectivity bounds.
- Dependencies: Effective algorithms derived from the characterization.
- Polyhedral Lagrangian smoothing (Question 8) — robotics/CAD
- Applications: Smooth, physically consistent Lagrangian surface design for manipulation and haptics; path planning honoring symplectic constraints.
- Tools: CAD plugins for Lagrangian smoothing; motion planners with symplectic feasibility checks.
- Dependencies: Constructive smoothing procedure and numerical robustness.
- Uniform lattices with 2-torsion and rationally acyclic manifolds (Question 7) — geometry/topology
- Applications: Constraints for manifold models in geometric group theory and numerical topology; potential implications for crystal and lattice modeling.
- Tools: Topology toolkits encoding (im)possibility results to prune search spaces.
- Dependencies: Clear translation of the classification result to computational criteria.
- Rankin–Selberg local integral witness functions (Question 2) — number theory/cryptography
- Applications: Improved computation of local factors/L-functions; potential tuning of cryptographic assumptions relying on number-theoretic heuristics.
- Tools: Libraries for explicit local integrals with guaranteed nonvanishing regions.
- Dependencies: Proofs that yield constructive Whittaker functions and stable numerics.

Across both categories, feasibility hinges on: sustained access to expert graders; robust anti-contamination practices; transparent interaction logs; and, for math-derived impacts, constructive proofs translating into algorithms with performance and stability guarantees.

View Paper Prompt View All Prompts

Glossary

Acyclic (over the rational numbers): A space whose reduced homology groups with coefficients in $\mathbb{Q}$ vanish. "whose universal cover is acyclic over the rational numbers $\mathbb{Q}$ ?"
Additive character: A continuous homomorphism from the additive group of a field to $\mathbb{C}^\times$ , often used in number theory. "Let $\psi:F\to \mathbb C^\times$ be a nontrivial additive character of conductor $\mathfrak o$ "
Admissible representation: For $p$ -adic groups, a smooth representation where invariant subspaces under compact open subgroups are finite-dimensional. "Let $\Pi$ be a generic irreducible admissible representation of $\mathrm{GL}_{n + 1}(F)$ "
ASEP polynomial (interpolation ASEP polynomial): A family of polynomials associated with the asymmetric simple exclusion process, here in an interpolation form. "interpolation ASEP polynomial and interpolation Macdonald polynomial"
CP decomposition: The CANDECOMP/PARAFAC factorization of a tensor as a sum of rank-1 components. "computing a CP decomposition of rank $r$ "
Conductor ideal: An ideal measuring the level of ramification or minimal level (newform) of a representation. "Let $\mathfrak{q}$ denote the conductor ideal of $\pi$ "
Equivalence of measures: Two measures are equivalent if they have the same null sets (mutual absolute continuity). "Here, equivalence of measures is in the sense of having the same null sets"
Equivariant stable category: The stable homotopy category of spectra equipped with a $G$ -action. "the $G$ -equivariant stable category adapted to $" - **Geometric fixed points**: A functor extracting the fixed-point spectrum while modding out the parts that are free under the group action. "in terms of the geometric fixed points." - **Hamiltonian isotopy**: A smooth one-parameter family of symplectomorphisms generated by a Hamiltonian function. "A Lagrangian smoothing of$ K $is a Hamiltonian isotopy$ K_t $" - **Interpolation Macdonald polynomial**: A variant of Macdonald polynomials (symmetric functions) defined via interpolation properties. "interpolation Macdonald polynomial" - **Khatri–Rao product**: The columnwise Kronecker product of matrices, used in tensor computations. "be the Khatri-Rao product" - **Lagrangian smoothing**: A process turning a singular Lagrangian into a smooth Lagrangian via Hamiltonian isotopy. "Does$ K $necessarily have a Lagrangian smoothing?" - **Lagrangian surface**: A 2-dimensional submanifold of a symplectic 4-manifold whose tangent spaces are Lagrangian subspaces. "A polyhedral Lagrangian surface$ K $in$ R^4 $" - **Laplacian matrix**: For a graph,$ L=D-A $, encoding connectivity and used in spectral graph theory. "Let$ L $be the Laplacian matrix of$ G $" - **MTTKRP (Matricized Tensor Times Khatri–Rao Product)**: A key operation in CP decomposition algorithms combining a tensor unfolding with the Khatri–Rao product. "We let$ B = TZ $denote the MTTKRP of the tensor$ \mathcal{T} $and Khatri-Rao product$ Z $." - **$ N_\infty $operad**: An operad in equivariant homotopy theory encoding commutative multiplications with norm maps. "associated to an$ N_\infty $operad." - **Non-archimedean local field**: A locally compact field complete with respect to a non-archimedean absolute value (e.g.,$ \mathbb{Q}_p$). "Let $F$ be a non-archimedean local field" - **Positive semidefinite (PSD)**: A matrix whose eigenvalues are all nonnegative, implying $x^\top M x \ge 0 $for all$ x $. "the matrix$ \epsilon L - L_S $is positive semidefinite." - **Preconditioned conjugate gradient (PCG)**: An iterative Krylov method for solving large symmetric positive definite linear systems efficiently. "Explain how an iterative preconditioned conjugate gradient linear solver" - **Pushforward (measure)**: The image measure$ T_\psi^*\mu $induced by a measurable map$ T_\psi $. "$ T_\psi^* $denotes the pushforward under$ T_\psi $" - **Rankin–Selberg integral**: Local integrals used to define and study$ L $-functions of automorphic forms/representations. "the local Rankin--Selberg integral" - **Reproducing Kernel Hilbert Space (RKHS)**: A Hilbert space of functions where evaluation is given by an inner product with a kernel function. "constrained to be in a Reproducing Kernel Hilbert Space (RKHS)." - **Restricted partition**: A partition satisfying specific constraints, here distinct parts with one zero and no ones. "Assume moreover that$ \lambda $is {\it restricted}, in the sense that it has a unique part of size$ 0 $and no part of size$ 1 $." - **Selection matrix**: A submatrix of the identity that selects specified entries from a vector or matrix. "we let$ S \in \mathbb{R}^{N \times q} $denote the selection matrix" - **Semisimple group**: A Lie group whose Lie algebra is semisimple (no nontrivial solvable ideals). "a real semi-simple group," - **Slice filtration**: A filtration of equivariant spectra into layers (“slices”) used to analyze their structure. "Define the slice filtration on the$ G $-equivariant stable category adapted to$ "
Uniform lattice: A discrete subgroup of a Lie group whose quotient is compact (cocompact lattice). "Suppose that $\Gamma$ is a uniform lattice in a real semi-simple group"
Unipotent: A matrix whose eigenvalues are all 1; in algebraic groups, elements with $(g-I)$ nilpotent. "upper-triangular unipotent elements."
Universal cover: A simply connected covering space of a topological space. "whose universal cover is acyclic over the rational numbers"
Vec operator (vectorization): The linear operation that stacks the columns of a matrix into a single vector. "The \vecop operations creates a vector from a matrix by stacking its columns,"
Whittaker model: A realization of a representation via Whittaker functions relative to a nontrivial character on a unipotent subgroup. "realized in its $\psi$ -Whittaker model $\mathcal W(\pi,\psi)$ ."
Zariski-generic: Holding on a Zariski-open dense subset; generic in algebraic geometry. "be Zariski-generic."
$\Phi^4_3$ measure: The (renormalized) Gibbs measure for the three-dimensional $\phi^4$ quantum field/the SPDE invariant measure. "the $\Phi^4_3$ measure on the space of distributions $\mathcal{D}'(\mathbb{T}^3)$ ."

View Paper Prompt View All Prompts

Open Problems

Autonomous AI capability on research-level mathematics

Continue Learning

Authors (11)

Collections

Tweets

YouTube

Show All Videos

HackerNews

First Proof (185 points, 121 comments)
First Proof (2 points, 2 comments)

A new AI mathematics assessment that was designed by mathematicians not employed or funded by AI companies. (545 points, 104 comments)
AI researchers asked GPT 5.2 Pro to solve math problems not found online thus not in their training . It struggled to solve them (11 points, 44 comments)
First Proof (2 points, 1 comment)
First Proof (1 point, 1 comment)

First Proof

Summary

Establishing an Objective Framework for Evaluating AI on Research-Level Mathematics

Motivation and Scope

Problem Set and Methodology

Evaluation Challenges and Related Work

Implementation Protocol

Implications and Prospects

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Overview

Objectives and Research Questions

Approach and Methods

Main Findings and Why They Matter

Implications and Potential Impact

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Immediate Applications

Long-Term Applications

Glossary

Open Problems

Continue Learning

Related Papers

Authors (11)

Collections

Tweets

YouTube

HackerNews

Reddit