Robust research-level mathematical reasoning by LLMs

Establish robust research-level mathematical reasoning capabilities in large language models by achieving significantly improved performance on unpublished, research-level benchmarks such as FrontierMath, thereby closing the current gap indicated by low accuracy on this benchmark.

Background

The paper surveys progress in natural-language mathematical reasoning and reports strong performance of recent LLMs on elementary and olympiad-level tasks. However, when evaluating models on research-level benchmarks, performance drops substantially. In particular, the FrontierMath Benchmark comprises unpublished problems intended to assess advanced reasoning beyond standardized exams.

Within this context, the authors explicitly note that robust research-level reasoning remains unresolved, highlighting a key capability gap between current LLM competence and the demands of open-ended, graduate and research-level mathematics.

References

Similarly, on the FrontierMath Benchmark, which consists of unpublished research-level problems, Gemini 3 Pro scores 18.75% on the research-level split, indicating that robust research-level reasoning remains an open problem.

AI for Mathematics: Progress, Challenges, and Prospects  (2601.13209 - Ju et al., 19 Jan 2026) in Section 3.1 (Natural Language Reasoning), paragraph discussing FrontierMath