Humanity’s Last Exam Benchmark

Updated 5 February 2026

Humanity’s Last Exam (HLE) is a comprehensive benchmark featuring 2,500 expert-crafted, multimodal questions that test advanced AI reasoning beyond memorization.
The exam integrates automatic difficulty screening and dual-stage expert curation to ensure high-quality, non-trivial assessments across over 100 subjects.
HLE also models civilization-scale survival risks by quantifying existential threats such as AI, nuclear war, and climate change, guiding strategic policy measures.

Humanity’s Last Exam (HLE) denotes a rigorous, frontier-level benchmark for evaluating both artificial intelligence reasoning and humanity’s survival against existential risk. The term is used in two interrelated domains: (1) as a multi-disciplinary, closed-ended test of AI/LLM capabilities at the limits of formalized human knowledge, and (2) as a metaphor and quantitative model for civilization-scale survival threats constituting the “Great Filter.” Both usages emphasize the necessity of surpassing saturated metrics and addressing challenges that have historically resisted automation—either in the form of robust academic reasoning or world-scale resilience.

1. Genesis and Purpose of Humanity’s Last Exam as an AI Benchmark

HLE was introduced in response to the rapid saturation of existing benchmarks such as MMLU, on which frontier LLMs now consistently achieve >90% accuracy, diminishing their discriminatory power at the high end of capability (Phan et al., 24 Jan 2025). HLE is structurally conceived as the “final closed-ended academic benchmark,” featuring 2,500 expert-authored questions, spanning over 100 subjects and integrating multimodal content (text, images, diagrams). Questions are original or non-trivial syntheses, each with exactly one unambiguous, easily verifiable solution, and are explicitly designed such that retrieval or memorization are ineffective: high performance requires genuine combination of domain concepts and novel reasoning.

Construction involved a global network of nearly 1,000 subject-matter experts from 500 institutions in 50 countries, with a multi-stage review pipeline: (a) automatic difficulty screening via cutting-edge LLMs (only accepting questions on which models scored at/below chance), and (b) two rounds of expert curation to ensure clarity, correctness, and domain coverage. The test set is split into public and private (held-out) subsets to detect benchmark memorization and ensure longitudinal validity.

2. Formal Properties, Evaluation Setup, and Domain Coverage

Dataset Structure and Protocol

HLE contains approximately 2,000–2,500 questions, with ∼80% being exact-match, short-answer; the remainder are complex multiple-choice with at least five alternatives (Phan et al., 24 Jan 2025, Chen et al., 28 Oct 2025). Around 10% include images or diagrams requiring multimodal interpretation. The subject pool encompasses mathematics, physics, chemistry, computer science, engineering, biology/medicine, humanities, social sciences, law, and specialized trivia, targeting advanced undergraduate to postdoctoral-level proficiency.

Judging is automated: candidate answers are parsed for a structured rationale, a final answer, and a confidence score (0–100%). Grading uses an LLM-as-judge system (e.g., GPT-4o or o3-mini), which compares the extracted answer to the ground truth using strict matching or tolerance-based numerical acceptance. Performance metrics include:

Accuracy: $\mathrm{Acc} = \frac{\#\,\mathrm{correct}}{\#\,\mathrm{total}} \times 100\%$ .
Calibration: Root-mean-square calibration error (RMS-CE),

$\mathrm{RMSCE} = \sqrt{\sum_{k=1}^K \frac{|B_k|}{N} \left[\mathrm{acc}(B_k) - \mathrm{conf}(B_k)\right]^2},$

where $B_k$ denotes the set of items in the $k$ th confidence bin.

Test protocols disallow open Internet access or external retrieval, except in tool-augmented agentic settings.

Metric	Definition	Purpose
Accuracy	Fraction of exact matches between system and gold answers	Baseline comparison
RMS-CE	Binned average of $\|\mathrm{acc}(B_k)-\mathrm{conf}(B_k)\|$ across confidence deciles	Calibration
Pass@k	Probability that at least 1 of $k$ samples is correct, for code/agentic tasks	Robustness/exploration

3. Empirical Results and Trends in Advanced AI Reasoning

HLE constitutes a true stress test for current and next-generation agents. On the standard public split, leading LLMs without specialized agentic orchestration or tool use (e.g., GPT-4o, DeepSeek-R1) achieve accuracy ranging from 3.3% to 9.4%, with RMS-CE values above 80%—demonstrating extreme overconfidence and lack of grounding (Phan et al., 24 Jan 2025, Vanhoyweghen et al., 19 Aug 2025). Tool-augmented and agentic agents narrow this gap:

X-Masters (scattered-and-stacked, tool-augmented architecture): 32.1% (Chai et al., 7 Jul 2025)
ToolOrchestra (RL-trained orchestrator, modular tool composition): 37.1% (Su et al., 26 Nov 2025)
AgentFrontier-30B-A3B, when fine-tuned with ZPD-guided synthetic data: 25.7–28.6% (Chen et al., 28 Oct 2025)
ReThinker (confidence-gated, multi-phase solver-critic-selector): 52.18% (Pass@1) with Gemini-3-Pro (Tang et al., 4 Feb 2026)

Domain-specific breakdowns highlight consistent trends: mathematics, natural sciences, and computer science remain the hardest domains; engineering and humanities, when tackled by strong agentic workflows, show substantial relative improvement (28–32%, e.g., with X-Masters).

In code generation, HLCE (Humanity’s Last Code Exam) extends the HLE paradigm to algorithmic programming, using the most difficult ICPC/IOI problems. State-of-the-art models achieve only 11–16% pass@1, massively below expert human levels (70–90%) (Li et al., 15 Jun 2025).

4. Methodological Innovations and Agentic Architectures

Recent progress in HLE has been driven by developments in agentic orchestration, dynamic tool selection, and multi-stage inference. Notable features include:

Code as Interaction Language: Agents (e.g., X-Master) emit Python code blocks to access built-in scientific libraries (NumPy, SciPy), retrieval tools, and custom web parsing, integrating external computation natively within their reasoning loop (Chai et al., 7 Jul 2025).
Scattered-and-Stacked Agentic Workflows: Forward solution “scatter” via multiple independent solvers, followed by iterative critical assessment, re-writing, and final selection stages. This architecture systematically increases the probability of discovering correct reasoning chains while excising compounding errors (Chai et al., 7 Jul 2025).
Confidence-Gated Computation: Workflows such as ReThinker dynamically allocate compute effort according to uncertainty, invoking multiple candidate solutions, critic-guided reflection, and position-robust aggregation through Latin square permutations and perplexity-weighted confidence scores (Tang et al., 4 Feb 2026).
Efficiency-Driven Orchestration: ToolOrchestra leverages lightweight 8B orchestrator models to optimize a multi-criteria reward (solved/not, cost, latency, tool preferences), yielding state-of-the-art HLE accuracy at a fraction of the compute cost of monolithic transformer inference (Su et al., 26 Nov 2025).

5. Calibration, Error Patterns, and Model Limitations

Despite structural advances, HLE exposes extreme miscalibration in model confidence, as well as persistent error modes:

Low predictive value of chain-of-thought (CoT) length: On HLE, output verbosity does not correlate with answer correctness, contrasting sharply with mid-difficulty benchmarks (Vanhoyweghen et al., 19 Aug 2025).
Lexical markers of uncertainty dominate error prediction: The presence of hedging words (“guess,” “stuck,” “hard”) in model rationales robustly signals incorrect answers. A simple filter on these tokens outperforms self-reported confidence as a calibration measure (MCC = 0.215 vs. 0.085) (Vanhoyweghen et al., 19 Aug 2025).
Tool-based workflows correct factual look-up and numerical computation errors, but reasoning failures due to subtle logic or domain ambiguity remain the main bottleneck (Chai et al., 7 Jul 2025).

This persistent gap underlines the need for calibration-aware, uncertainty-sensitive architectures, and motivates continued research into methodologically robust post-hoc calibration filters—for example, deferring to fallback processes whenever hedging or uncertainty appears in reasoning (Vanhoyweghen et al., 19 Aug 2025).

6. The Existential “Humanity’s Last Exam”: Survival Analysis and Policy Meta-Benchmark

In a separate but interconnected thread, “Humanity’s Last Exam” has been adopted as a framework for quantifying humanity's survival against existential risks—the so-called “Great Filter” (Jiang et al., 2022, Jiang et al., 2022). Here, the “exam” is formalized via survival analysis:

$T = \min\{T_\text{nuclear}, T_\text{climate}, T_\text{asteroid}, T_\text{AI}, T_\text{pandemic}\}$

The corresponding survival function is $S(t) = \prod_i S_i(t)$ , where $S_i(t)$ are survival curves for each independent existential threat. Simulated expected survival times:

Threat	Mean to Failure $E[T_i]$
Pandemic	16 years
Artificial Intelligence	40 years
Nuclear War	60 years
Climate Change	193 years
Asteroid Impact	1,754 years

This mathematical model prioritizes policy and response efforts according to the imminence of each “question.” The framing produces a strategic roadmap to extend $\mathrm{RMSCE} = \sqrt{\sum_{k=1}^K \frac{|B_k|}{N} \left[\mathrm{acc}(B_k) - \mathrm{conf}(B_k)\right]^2},$ 0 (effective civilization lifespan) towards the high-value tail, integrating risk mitigation protocols for each domain.

7. Operationalization as an HLAI Test and Theoretical Foundations

Finally, the “Humanity’s Last Exam” is formalized as a reproducible test for Human-Level Artificial Intelligence (HLAI): the Language Acquisition Test (LAT) (Park et al., 2020). In this definition, an agent is HLAI if, for any direct experience $\mathrm{RMSCE} = \sqrt{\sum_{k=1}^K \frac{|B_k|}{N} \left[\mathrm{acc}(B_k) - \mathrm{conf}(B_k)\right]^2},$ 1 in a Markov Decision Process (MDP), there exists a language description $\mathrm{RMSCE} = \sqrt{\sum_{k=1}^K \frac{|B_k|}{N} \left[\mathrm{acc}(B_k) - \mathrm{conf}(B_k)\right]^2},$ 2 such that updating the agent’s policy on $\mathrm{RMSCE} = \sqrt{\sum_{k=1}^K \frac{|B_k|}{N} \left[\mathrm{acc}(B_k) - \mathrm{conf}(B_k)\right]^2},$ 3 brings it within an arbitrarily small KL-divergence $\mathrm{RMSCE} = \sqrt{\sum_{k=1}^K \frac{|B_k|}{N} \left[\mathrm{acc}(B_k) - \mathrm{conf}(B_k)\right]^2},$ 4 of the update from $\mathrm{RMSCE} = \sqrt{\sum_{k=1}^K \frac{|B_k|}{N} \left[\mathrm{acc}(B_k) - \mathrm{conf}(B_k)\right]^2},$ 5:

$\mathrm{RMSCE} = \sqrt{\sum_{k=1}^K \frac{|B_k|}{N} \left[\mathrm{acc}(B_k) - \mathrm{conf}(B_k)\right]^2},$ 6

The LAT is then specified as a staged curriculum in a simulated “infant-like” environment: agents must update their action-value function in response to symbolic, reward-free descriptions, directly mirroring human language-based policy refinement. Passing the “exam” requires demonstrable, language-only acquisition across a sequence of unseen tasks, operationalizing the benchmark for true HLAI.

References

(Phan et al., 24 Jan 2025) Humanity's Last Exam
(Vanhoyweghen et al., 19 Aug 2025) Lexical Hints of Accuracy in LLM Reasoning Chains
(Chai et al., 7 Jul 2025) SciMaster: Towards General-Purpose Scientific AI Agents, Part I. X-Master as Foundation
(Tang et al., 4 Feb 2026) ReThinker: Scientific Reasoning by Rethinking with Guided Reflection and Confidence Control
(Chen et al., 28 Oct 2025) AgentFrontier: Expanding the Capability Frontier of LLM Agents with ZPD-Guided Data Synthesis
(Su et al., 26 Nov 2025) ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orchestration
(Li et al., 15 Jun 2025) Humanity's Last Code Exam: Can Advanced LLMs Conquer Human's Hardest Code Competition?
(Park et al., 2020) A Definition and a Test for Human-Level Artificial Intelligence
(Jiang et al., 2022) Avoiding the Great Filter: A Simulation of Important Factors for Human Survival
(Jiang et al., 2022) Avoiding the "Great Filter": Extraterrestrial Life and Humanity's Future in the Universe