FHIR-AgentBench: Benchmarking LLM Agents for Realistic Interoperable EHR Question Answering

Published 12 Sep 2025 in cs.CL and cs.AI | (2509.19319v1)

Abstract: The recent shift toward the Health Level Seven Fast Healthcare Interoperability Resources (HL7 FHIR) standard opens a new frontier for clinical AI, demanding LLM agents to navigate complex, resource-based data models instead of conventional structured health data. However, existing benchmarks have lagged behind this transition, lacking the realism needed to evaluate recent LLMs on interoperable clinical data. To bridge this gap, we introduce FHIR-AgentBench, a benchmark that grounds 2,931 real-world clinical questions in the HL7 FHIR standard. Using this benchmark, we systematically evaluate agentic frameworks, comparing different data retrieval strategies (direct FHIR API calls vs. specialized tools), interaction patterns (single-turn vs. multi-turn), and reasoning strategies (natural language vs. code generation). Our experiments highlight the practical challenges of retrieving data from intricate FHIR resources and the difficulty of reasoning over them, both of which critically affect question answering performance. We publicly release the FHIR-AgentBench dataset and evaluation suite (https://github.com/glee4810/FHIR-AgentBench) to promote reproducible research and the development of robust, reliable LLM agents for clinical applications.

Abstract PDF Upgrade to Chat

Summary

The paper introduces FHIR-AgentBench, a comprehensive benchmark for evaluating LLM agents in realistic, interoperable EHR question answering.
It employs a two-step evaluation process using metrics like retrieval precision and answer correctness to assess multi-turn and code-integrated approaches.
Despite the benefits of multi-turn interaction, retrieval precision remains a bottleneck, emphasizing the need for improved parsing of complex FHIR structures.

FHIR-AgentBench: Benchmarking LLM Agents for Realistic Interoperable EHR Question Answering

Introduction

LLMs in healthcare have been evolving to handle complex, real-world data in Electronic Health Records (EHRs). The introduction of Health Level Seven Fast Healthcare Interoperability Resources (HL7 FHIR) demands a new approach for evaluating these models, which prior benchmarks have not adequately addressed. FHIR-AgentBench fills this gap by providing a comprehensive benchmark that evaluates the capabilities of LLM agents in the context of FHIR-based question answering, grounded in real-world clinical scenarios.

Figure 1: Overview of the construction process for FHIR-AgentBench.

Benchmark Overview

FHIR-AgentBench evaluates LLM agents across dimensions critical for clinical applications: clinical authenticity, data complexity, and interoperability via FHIR. It comprises 2,931 real-world questions derived from the MIMIC-IV-FHIR dataset, a de-identified patient database known for its comprehensive coverage of clinical events. The benchmark challenges LLMs to perform various complex tasks, including multi-step reasoning, robust data retrieval, and synthesis across diverse FHIR resources.

The benchmark construction process involves selecting relevant questions, generating natural language questions (NLQs) and answers, and verifying these against FHIR resources (Figure 1). The focus on realism ensures that the benchmark tests agents on tasks reflective of actual clinical inquiries, thus providing a more rigorous evaluation of their real-world utility.

Agent Evaluation

In evaluating agent performance, the FHIR-AgentBench employs a two-step task: first, retrieving relevant data from FHIR resources, and second, generating correct answers. Metrics used include retrieval precision and recall, alongside answer correctness. Notably, retrieval precision is often low due to the extensive query noise introduced during the retrieval phase, underscoring the benchmark's challenging nature.

Our experiments show that multi-turn agents, equipped with code interpreters, significantly outperform single-turn approaches, achieving up to 50% answer correctness. These results highlight the importance of iterative reasoning and procedural logic when dealing with FHIR’s nested data structures. However, the challenge remains substantial, with no single LLM surpassing others convincingly across all metrics.

Key Findings

Multi-Turn Interaction: Agents capable of iterative reasoning exhibit higher recall and better answer correctness. This suggests that decomposing tasks into multiple interactions aids in navigating FHIR’s complexity.
Code Integration: The use of code generation improves agents' capabilities to parse and interpret data, crucial for understanding FHIR’s graph-like resource structures.
Retrieval Challenges: Despite retrieval attempts, the precision remains a bottleneck, impacting the agents' ability to synthesize accurate responses from retrieved data.
Figure 2: Relative gain in answer correctness across precision buckets for the o4-mini model under different agent frameworks. Gains are calculated after filtering to questions with perfect recall, isolating the effect of unnecessary resources on answer correctness.

Error Analysis

Understanding agent failure modes provides crucial insights for future improvements. Common errors include:

Mapping Clinical Concepts: Agents often misidentify resource types, leading to retrieval errors. Tools generating overly specific queries contribute to missed relevant data.
Parsing Failures: Even when resources are correctly retrieved, agents struggle to interpret the intricate FHIR structures, such as navigating references or filtering correctly on relevant criteria.
Figure 3: Illustration of FHIR resource ID translation process from SQL query.

Conclusion

FHIR-AgentBench presents a significant step forward in benchmarking LLMs for realistic EHR interoperability. By challenging agents with complex, clinically relevant tasks grounded in the FHIR framework, it offers a detailed view of current capabilities and areas needing improvement. Future research could expand the dataset scope and refine agent architectures to better handle FHIR’s intricacies, ultimately aiming for robust, deployable clinical AI applications. By openly releasing this benchmark, we provide a platform for advancing the field toward truly interoperable AI solutions in healthcare.

Markdown Report Issue