Robust research-level mathematical reasoning by LLMs
Establish robust research-level mathematical reasoning capabilities in large language models by achieving significantly improved performance on unpublished, research-level benchmarks such as FrontierMath, thereby closing the current gap indicated by low accuracy on this benchmark.
References
Similarly, on the FrontierMath Benchmark, which consists of unpublished research-level problems, Gemini 3 Pro scores 18.75% on the research-level split, indicating that robust research-level reasoning remains an open problem.
— AI for Mathematics: Progress, Challenges, and Prospects
(2601.13209 - Ju et al., 19 Jan 2026) in Section 3.1 (Natural Language Reasoning), paragraph discussing FrontierMath