Evaluating Large Language Models for Automated Clinical Abstraction in Pulmonary Embolism Registries: Performance Across Model Sizes, Versions, and Parameters
Abstract: Pulmonary embolism (PE) registries accelerate practice improving research but rely on labor intensive manual abstraction of radiology reports. We examined whether openly available LLMs can automate concept extraction from computed tomography PE (CTPE) reports without loss of data quality. Four Llama 3 variants (3.0 8B, 3.1 8B, 3.1 70B, 3.3 70B) and one reviewer model, Phi 4 14B, were tested on 250 dual annotated CTPE reports from each of MIMIC IV and Duke University. Accuracy, positive predictive value (PPV) and negative predictive value (NPV) versus a human gold standard were measured across model size, temperature and shot count. Mean accuracy rose with scale: 0.83 (3.0 8B), 0.91 (3.1 8B) and 0.96 for both 70B variants; Phi 4 14B reached 0.98. Accuracy differed by less than 0.03 between datasets, indicating external robustness. In dual model concordance (L3 70B plus Phi 4 14B) PPV for PE presence was at least 0.95 and NPV at least 0.98, while location, thrombus burden, right heart strain and image quality artifacts each achieved PPV of at least 0.90 and NPV of at least 0.95. Fewer than four percent of individual concept annotations were discordant, and full agreement occurred in more than seventy five percent of reports. LLMs therefore provide a scalable, accurate solution for PE registry abstraction, and a dual model review workflow can safeguard data quality with minimal human oversight.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.