Evaluating Large Language Models for Automated Clinical Abstraction in Pulmonary Embolism Registries: Performance Across Model Sizes, Versions, and Parameters

Published 26 Mar 2025 in cs.CL | (2503.21004v2)

Abstract: Pulmonary embolism (PE) registries accelerate practice improving research but rely on labor intensive manual abstraction of radiology reports. We examined whether openly available LLMs can automate concept extraction from computed tomography PE (CTPE) reports without loss of data quality. Four Llama 3 variants (3.0 8B, 3.1 8B, 3.1 70B, 3.3 70B) and one reviewer model, Phi 4 14B, were tested on 250 dual annotated CTPE reports from each of MIMIC IV and Duke University. Accuracy, positive predictive value (PPV) and negative predictive value (NPV) versus a human gold standard were measured across model size, temperature and shot count. Mean accuracy rose with scale: 0.83 (3.0 8B), 0.91 (3.1 8B) and 0.96 for both 70B variants; Phi 4 14B reached 0.98. Accuracy differed by less than 0.03 between datasets, indicating external robustness. In dual model concordance (L3 70B plus Phi 4 14B) PPV for PE presence was at least 0.95 and NPV at least 0.98, while location, thrombus burden, right heart strain and image quality artifacts each achieved PPV of at least 0.90 and NPV of at least 0.95. Fewer than four percent of individual concept annotations were discordant, and full agreement occurred in more than seventy five percent of reports. LLMs therefore provide a scalable, accurate solution for PE registry abstraction, and a dual model review workflow can safeguard data quality with minimal human oversight.