- The paper presents a practical study benchmarking fine-tuned Phi-3 models for automating surgical billing and coding tasks using patient data while maintaining security.
- Utilizing QLoRA for fine-tuning on local surgical data, the Phi-3-Medium model achieved higher precision for ICD-10 (72%) and CPT (79%) coding compared to larger models like GPT-4o.
- The results demonstrate that smaller, domain-specific fine-tuned LLMs like Phi-3 can match or exceed SOTA performance for specialized healthcare tasks with economical resource usage, aiding workflow efficiency.
The paper presents a comprehensive study on the design, application, and benchmarking of Generative AI tools tailored for surgical billing and coding. Its primary focus is on adapting LLMs to effectively automate the generation of International Classification of Diseases, 10th edition, Clinical Modification (ICD-10-CM) codes, Current Procedural Terminology (CPT) codes, and modifiers from postoperative surgical reports. The intent is to improve the accuracy and efficiency of the billing process while ensuring patient privacy through the use of fine-tuned generative models.
Background and Motivation
The healthcare sector, characterized by complex processes and stringent regulatory requirements, stands to benefit significantly from the automation capabilities of Generative AI, particularly in administrative tasks such as billing and coding. Despite the promise shown by general LLMs such as GPT-4 in other domains, their performance is suboptimal for healthcare-specific tasks. The study acknowledges the challenges in training foundational models due to resource constraints and proposes an alternative strategy through fine-tuning existing models for domain-specific applications.
Methods
The paper explores four configurations of the Phi-3 models from Microsoft:
- The base Phi-3-Mini model.
- A Retrieval-Augmented Generation (RAG) system combining the base model with a knowledge database.
- A fine-tuned Phi-3-Mini model on local institutional data.
- A fine-tuned Phi-3-Medium model on the same data.
The fine-tuning process employs Quantized Low Rank Adapters (QLoRA) to optimize training on limited technical infrastructure, consisting of four NVIDIA A5000 GPUs. This approach focuses on full parameter supervised fine-tuning (SFT) and Parameter Efficient Fine-tuning (PEFT) techniques.
Data and Security
The study uses data from around 192,585 surgical encounters across Duke University's health system from 2017 to 2022. Operative reports serve as inputs, with billing claims comprising ICD-10, CPT, and modifiers as targets. The project ensures compliance with security standards by utilizing a Protected Analytics Computing Environment (PACE) for handling Protected Health Information (PHI).
Results
The fine-tuned Phi-3-Medium model outperformed others, including GPT-4, in generating valid code sets with recall and precision rates considerably higher for ICD-10 (72% precision), CPT (79% precision), and modifiers. It generated a minimal proposition of fabricated codes (1% for ICD-10 and 0.6% for CPT), indicating its robustness compared to larger models like GPT-4o which fabricated more codes (3% for ICD-10). The study utilized metrics such as ROUGE-L and METEOR to further verify the consistency and accuracy of generated outputs.
Discussion
This work demonstrates the feasibility of using smaller, fine-tuned models for healthcare applications without significant resource investments. The Phi-3-Medium model's performance suggests that domain-specific fine-tuning of LLMs can match or exceed SOTA models in specialized tasks like billing and coding. Importantly, the configurations maintained economical resource usage, paving the way for broader institutional adoption without necessitating large-scale technological investments.
Conclusion
The results underscore the potential of integrating generative AI into healthcare administrative workflows. Despite existing as a supplementary tool rather than a replacement for coders, such AI solutions can streamline the coding process, reduce errors, and augment healthcare delivery efficiency. The study advocates for further experimentation with different configurations and additional types of patient records, particularly including more comprehensive data such as History and Physical (H&P) reports, to enhance diagnostic code generation accuracy.