SciEval: A Multi-Level Large Language Model Evaluation Benchmark for Scientific Research

Published 25 Aug 2023 in cs.CL | (2308.13149v2)

Abstract: Recently, there has been growing interest in using LLMs for scientific research. Numerous benchmarks have been proposed to evaluate the ability of LLMs for scientific research. However, current benchmarks are mostly based on pre-collected objective questions. This design suffers from data leakage problem and lacks the evaluation of subjective Q/A ability. In this paper, we propose SciEval, a comprehensive and multi-disciplinary evaluation benchmark to address these issues. Based on Bloom's taxonomy, SciEval covers four dimensions to systematically evaluate scientific research ability. In particular, we design a "dynamic" subset based on scientific principles to prevent evaluation from potential data leakage. Both objective and subjective questions are included in SciEval. These characteristics make SciEval a more effective benchmark for scientific research ability evaluation of LLMs. Comprehensive experiments on most advanced LLMs show that, although GPT-4 achieves SOTA performance compared to other LLMs, there is still substantial room for improvement, especially for dynamic questions. The codes and data are publicly available on https://github.com/OpenDFM/SciEval.

Abstract PDF Upgrade to Chat

Citations (47)

View on Semantic Scholar

Summary

The paper presents a novel benchmark, SciEval, that evaluates LLMs using Bloom's taxonomy across disciplines like chemistry, physics, and biology.
It introduces a multi-level evaluation system that assesses knowledge, application, scientific calculation, and research ability through static, dynamic, and experimental data.
Experimental results reveal gaps in knowledge application and higher-order cognitive tasks, emphasizing the need for refined data generation and evaluation strategies.

Overview of "SciEval: A Multi-Level LLM Evaluation Benchmark for Scientific Research"

The paper "SciEval: A Multi-Level LLM Evaluation Benchmark for Scientific Research" introduces a novel benchmark specifically designed to evaluate the performance of LLMs in scientific research contexts. SciEval aims to address the limitations of existing benchmarks, which are often restricted to objective questions and vulnerable to data leakage. This benchmark incorporates a multi-disciplinary approach, grounded in Bloom's taxonomy, to assess various cognitive capabilities of LLMs in scientific domains, notably chemistry, physics, and biology.

Evaluation System and Cognitive Dimensions

The SciEval benchmark uses Bloom's taxonomy as a foundation for its evaluation system, which spans four primary dimensions: Basic Knowledge, Knowledge Application, Scientific Calculation, and Research Ability. These dimensions are mapped to different cognitive levels, enabling a comprehensive assessment of LLMs’ ability to recall, understand, apply, analyze, evaluate, and create based on scientific principles.

Figure 1: The illustration of the evaluation system. SciEval covers three disciplines with amounts of sub-topics, and investigates four abilities, corresponding to six cognitive levels.

Data Collection Process

Data for SciEval is categorized into three types: Static Data, Dynamic Data, and Experimental Data. The Static Data is curated from existing datasets such as Socratic Q&A, MedQA, and PubMedQA to ensure a diverse range of questions. Dynamic Data is generated to mitigate data leakage risks, allowing for periodic updates. Experimental Data provides subjective questions that evaluate experimental design and analysis capabilities, further testing the nuanced research abilities of LLMs.

Figure 2: Data Collection steps of Static Data.

Evaluation Methodology

SciEval evaluates models across various settings, namely Answer-Only (AO), Chain-of-Thought (CoT), and 3-Shot settings. These setups are critical in assessing the models' reasoning and application capabilities beyond basic accuracy metrics.

Figure 3: An example of the prompt we used for AO setting. The red text is the response from the model, while the black text is the inputted prompt.

Figure 4: An example of the prompt we used for CoT setting. The red text is the response from the model, while the blue text and black text are the inputted prompt.

Experimental Results and Analysis

Static and Dynamic Data Results

The results on Static Data indicate that GPT-4 achieves the highest average accuracy, although gaps remain in Knowledge Application and Scientific Calculation. In Dynamic Data, while a few models achieve notable performance in counting and calculation problems, most LLMs struggle with chemistry and physics questions, indicating significant knowledge gaps.

Figure 5: Accuracy on Answer Only, Chain-of-Thought and 3-Shot settings of each LLMs for Static Data.

Experimental Data

On Experimental Data, GPT-4 and Claude models proved proficient in experimental principle and design but consistently underperformed in analyzing experimental results, revealing a prevalent weakness in higher-order cognitive tasks.

Conclusion and Implications

SciEval presents a robust, multi-faceted benchmark pertinent to assessing LLMs in scientific contexts, addressing crucial gaps in existing evaluation paradigms. Future work may focus on refining dynamic data generation methods, enhancing subjective question evaluations, and expanding the benchmark to encompass a broader range of scientific fields. By furnishing a comprehensive evaluation toolkit for scientific research capabilities of LLMs, SciEval promotes ongoing advancements and application of AI in scientific inquiry.

Markdown Report Issue