SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines

Published 20 Feb 2025 in cs.CL | (2502.14739v4)

Abstract: LLMs have demonstrated remarkable proficiency in mainstream academic disciplines such as mathematics, physics, and computer science. However, human knowledge encompasses over 200 specialized disciplines, far exceeding the scope of existing benchmarks. The capabilities of LLMs in many of these specialized fields-particularly in light industry, agriculture, and service-oriented disciplines-remain inadequately evaluated. To address this gap, we present SuperGPQA, a comprehensive benchmark that evaluates graduate-level knowledge and reasoning capabilities across 285 disciplines. Our benchmark employs a novel Human-LLM collaborative filtering mechanism to eliminate trivial or ambiguous questions through iterative refinement based on both LLM responses and expert feedback. Our experimental results reveal significant room for improvement in the performance of current state-of-the-art LLMs across diverse knowledge domains (e.g., the reasoning-focused model DeepSeek-R1 achieved the highest accuracy of 61.82% on SuperGPQA), highlighting the considerable gap between current model capabilities and artificial general intelligence. Additionally, we present comprehensive insights from our management of a large-scale annotation process, involving over 80 expert annotators and an interactive Human-LLM collaborative system, offering valuable methodological guidance for future research initiatives of comparable scope.

Abstract PDF Upgrade to Chat

Summary

The paper introduces SuperGPQA, a new benchmark featuring over 26,000 expert questions across 285 graduate fields to rigorously evaluate LLM performance, showing top models like DeepSeek-R1 (max 61.82%) underperform in specialized domains.
Experiments reveal current LLMs struggle with diverse, nuanced questions across fields, highlighting a gap in cross-disciplinary reasoning skills needed for real-world scenarios.
SuperGPQA sets a higher standard for LLM evaluation, pushing the field towards developing more versatile, adaptable intelligent systems capable of true cross-disciplinary understanding.

An Analytical Overview of "SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines"

The proliferation and increasing sophistication of LLMs necessitate rigorous benchmarks to adequately measure their capabilities. The paper "SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines" addresses this need by expanding the evaluation landscape that LLMs are typically tested against, encompassing not only the well-trodden domains of mathematics and computer science but also the more specialized fields like light industry and agriculture.

Key Contributions

"SuperGPQA" notably introduces a comprehensive evaluation dataset that spans 285 disciplines, providing over 26,000 questions crafted with the engagement of human experts and sophisticated LLM systems. Traditional benchmarks have predominantly centered around mainstream academic areas, potentially leaving gaps in the LLMs’ understanding of specialized domains. By employing a Human-LLM collaborative filtering mechanism, SuperGPQA aims to purge trivial or ambiguous questions, thereby ensuring evaluative depth and relevance.

Not only does this benchmark strive to fill the void of uncharted knowledge areas, but it also emphasizes graduate-level nuance by involving a vast pool of expert annotators in its creation. This methodological approach adds to the robustness and applicability of SuperGPQA, offering a more stringent measure of LLMs such as DeepSeek-R1, which notably achieved a maximum accuracy of 61.82% in experiments, suggesting significant room for advancement towards artificial general intelligence (AGI).

Experimental Insights

The experimentations underscore existing deficiencies in LLMs as they strive to comprehend and interpret nuanced questions spanning diverse fields. Even the current state-of-the-art models such as DeepSeek-R1 and o1-2024-12-17 are shown to underperform when confronted with the rich and diverse question set provided by SuperGPQA, highlighting clear gaps in cross-disciplinary reasoning skills. This gap further aligns with the study's revelation that the more robust reasoning models excelled comprehensive models in handling harder unique problems, although they grapple yet with the elementary facts found in longer-explored domains.

Practical and Theoretical Implications

Practically, deploying SuperGPQA can challenge mainstream LLMs, ensuring they develop capabilities beyond narrow domains, maturing towards holistic understanding required by real-world scenarios. Theoretically, this work propels the discourse and research on AGI, setting a distinctive standard for the logical sophistication and domain adaptability expected from LLMs.

Anticipated Delineation and Future Research

Future inquiries might benefit from employing the SuperGPQA dataset in assessing and enhancing a more expansive array of LLM architectures, exploring dimension-specific weaknesses, and cultivating a better understanding of domain adaptability processes. There remains an opportunity to also look beyond current structures, pondering on the integration of such broad categorizations into developing more personalized models.

Concluding Remarks

SuperGPQA convincingly affords the LLM community an expansive and sophisticated evaluation tool that captures a more genuine reflection of human knowledge domains. The gap highlighted by current top-performing models not only elucidates their current limitations but also redefines ongoing expectations and standards for what AGI represents. Importantly, this benchmark is posed as a solid foundation and a call to action for LLM developers to rise to the challenge of creating truly versatile and adaptable intelligent systems.

Markdown Report Issue