Practical Design and Benchmarking of Generative AI Applications for Surgical Billing and Coding

Published 7 Jan 2025 in cs.CL and cs.LG | (2501.05479v1)

Abstract: Background: Healthcare has many manual processes that can benefit from automation and augmentation with Generative AI, the medical billing and coding process. However, current foundational LLMs perform poorly when tasked with generating accurate International Classification of Diseases, 10th edition, Clinical Modification (ICD-10-CM) and Current Procedural Terminology (CPT) codes. Additionally, there are many security and financial challenges in the application of generative AI to healthcare. We present a strategy for developing generative AI tools in healthcare, specifically for medical billing and coding, that balances accuracy, accessibility, and patient privacy. Methods: We fine tune the PHI-3 Mini and PHI-3 Medium LLMs using institutional data and compare the results against the PHI-3 base model, a PHI-3 RAG application, and GPT-4o. We use the post operative surgical report as input and the patients billing claim the associated ICD-10, CPT, and Modifier codes as the target result. Performance is measured by accuracy of code generation, proportion of invalid codes, and the fidelity of the billing claim format. Results: Both fine-tuned models performed better or as well as GPT-4o. The Phi-3 Medium fine-tuned model showed the best performance (ICD-10 Recall and Precision: 72%, 72%; CPT Recall and Precision: 77%, 79%; Modifier Recall and Precision: 63%, 64%). The Phi-3 Medium fine-tuned model only fabricated 1% of ICD-10 codes and 0.6% of CPT codes generated. Conclusions: Our study shows that a small model that is fine-tuned on domain-specific data for specific tasks using a simple set of open-source tools and minimal technological and monetary requirements performs as well as the larger contemporary consumer models.

Abstract PDF Upgrade to Chat

Summary

The paper presents a practical study benchmarking fine-tuned Phi-3 models for automating surgical billing and coding tasks using patient data while maintaining security.
Utilizing QLoRA for fine-tuning on local surgical data, the Phi-3-Medium model achieved higher precision for ICD-10 (72%) and CPT (79%) coding compared to larger models like GPT-4o.
The results demonstrate that smaller, domain-specific fine-tuned LLMs like Phi-3 can match or exceed SOTA performance for specialized healthcare tasks with economical resource usage, aiding workflow efficiency.

The paper presents a comprehensive study on the design, application, and benchmarking of Generative AI tools tailored for surgical billing and coding. Its primary focus is on adapting LLMs to effectively automate the generation of International Classification of Diseases, 10th edition, Clinical Modification (ICD-10-CM) codes, Current Procedural Terminology (CPT) codes, and modifiers from postoperative surgical reports. The intent is to improve the accuracy and efficiency of the billing process while ensuring patient privacy through the use of fine-tuned generative models.

Background and Motivation

The healthcare sector, characterized by complex processes and stringent regulatory requirements, stands to benefit significantly from the automation capabilities of Generative AI, particularly in administrative tasks such as billing and coding. Despite the promise shown by general LLMs such as GPT-4 in other domains, their performance is suboptimal for healthcare-specific tasks. The study acknowledges the challenges in training foundational models due to resource constraints and proposes an alternative strategy through fine-tuning existing models for domain-specific applications.

Methods

The paper explores four configurations of the Phi-3 models from Microsoft:

The base Phi-3-Mini model.
A Retrieval-Augmented Generation (RAG) system combining the base model with a knowledge database.
A fine-tuned Phi-3-Mini model on local institutional data.
A fine-tuned Phi-3-Medium model on the same data.

The fine-tuning process employs Quantized Low Rank Adapters (QLoRA) to optimize training on limited technical infrastructure, consisting of four NVIDIA A5000 GPUs. This approach focuses on full parameter supervised fine-tuning (SFT) and Parameter Efficient Fine-tuning (PEFT) techniques.

Data and Security

The study uses data from around 192,585 surgical encounters across Duke University's health system from 2017 to 2022. Operative reports serve as inputs, with billing claims comprising ICD-10, CPT, and modifiers as targets. The project ensures compliance with security standards by utilizing a Protected Analytics Computing Environment (PACE) for handling Protected Health Information (PHI).

Results

The fine-tuned Phi-3-Medium model outperformed others, including GPT-4, in generating valid code sets with recall and precision rates considerably higher for ICD-10 (72% precision), CPT (79% precision), and modifiers. It generated a minimal proposition of fabricated codes (1% for ICD-10 and 0.6% for CPT), indicating its robustness compared to larger models like GPT-4o which fabricated more codes (3% for ICD-10). The study utilized metrics such as ROUGE-L and METEOR to further verify the consistency and accuracy of generated outputs.

Discussion

This work demonstrates the feasibility of using smaller, fine-tuned models for healthcare applications without significant resource investments. The Phi-3-Medium model's performance suggests that domain-specific fine-tuning of LLMs can match or exceed SOTA models in specialized tasks like billing and coding. Importantly, the configurations maintained economical resource usage, paving the way for broader institutional adoption without necessitating large-scale technological investments.

Conclusion

The results underscore the potential of integrating generative AI into healthcare administrative workflows. Despite existing as a supplementary tool rather than a replacement for coders, such AI solutions can streamline the coding process, reduce errors, and augment healthcare delivery efficiency. The study advocates for further experimentation with different configurations and additional types of patient records, particularly including more comprehensive data such as History and Physical (H&P) reports, to enhance diagnostic code generation accuracy.

Markdown Report Issue