Papers
Topics
Authors
Recent
Search
2000 character limit reached

Benchmarking Reasoning Reliability in Artificial Intelligence Models for Energy-System Analysis

Published 16 Oct 2025 in cs.AI, cs.SY, and eess.SY | (2510.19836v1)

Abstract: Artificial intelligence and machine learning are increasingly used for forecasting, optimization, and policy design in the energy sector, yet no standardized framework exists to evaluate whether these systems reason correctly. Current validation practices focus on predictive accuracy or computational efficiency, leaving the logical integrity of analytical conclusions untested. This study introduces the Analytical Reliability Benchmark (ARB), a reproducible framework that quantifies reasoning reliability in LLMs applied to energy system analysis. The benchmark integrates five submetrics: accuracy, reasoning reliability, uncertainty discipline, policy consistency, and transparency, and evaluates model performance across deterministic, probabilistic, and epistemic scenarios using open technoeconomic datasets (NREL ATB 2024, DOE H2A/H2New, IEA WEO 2024). Four frontier models (GPT-4/5, Claude 4.5 Sonnet, Gemini 2.5 Pro, Llama 3 70B) were tested under identical factual and regulatory conditions. Results show that reasoning reliability can be objectively measured. GPT-4/5 and Claude 4.5 Sonnet achieved consistent and policy-compliant reasoning (Analytical Reliability Index greater than 90), Gemini 2.5 Pro demonstrated moderate stability, and Llama 3 70B remained below professional thresholds. Statistical validation confirmed that these differences are significant and reproducible. The ARB establishes the first quantitative method in the energy literature for verifying causal, probabilistic, and policy-driven reasoning in artificial intelligence systems, providing a reference framework for trustworthy and transparent analytical applications in the global energy transition.

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Authors (1)

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.