RuCCoD: Towards Automated ICD Coding in Russian

Published 28 Feb 2025 in cs.CL, cs.AI, and cs.DB | (2502.21263v1)

Abstract: This study investigates the feasibility of automating clinical coding in Russian, a language with limited biomedical resources. We present a new dataset for ICD coding, which includes diagnosis fields from electronic health records (EHRs) annotated with over 10,000 entities and more than 1,500 unique ICD codes. This dataset serves as a benchmark for several state-of-the-art models, including BERT, LLaMA with LoRA, and RAG, with additional experiments examining transfer learning across domains (from PubMed abstracts to medical diagnosis) and terminologies (from UMLS concepts to ICD codes). We then apply the best-performing model to label an in-house EHR dataset containing patient histories from 2017 to 2021. Our experiments, conducted on a carefully curated test set, demonstrate that training with the automated predicted codes leads to a significant improvement in accuracy compared to manually annotated data from physicians. We believe our findings offer valuable insights into the potential for automating clinical coding in resource-limited languages like Russian, which could enhance clinical efficiency and data accuracy in these contexts.

Abstract PDF Upgrade to Chat

Summary

The paper introduces RuCCoD, a dataset containing over 10,000 entities and 1,500+ ICD codes extracted from Russian EHRs to support automated coding.
Benchmarking models like BERT and LLaMA showed that targeted fine-tuning significantly outperforms domain-transfer methods for Russian ICD coding.
Utilizing automatically generated ICD codes was shown to improve diagnosis prediction accuracy, indicating the potential for enhancing clinical efficiency in resource-limited languages like Russian.

The paper "RuCCoD: Towards Automated ICD Coding in Russian" introduces a dataset and methodology that significantly enhance automated clinical coding for Russian EHRs.

It presents RuCCoD, a dataset with over 10,000 entities and more than 1,500 ICD codes extracted from EHRs.
It benchmarks models including BERT, LLaMA with LoRA, and RAG, showing that targeted fine-tuning outperforms domain-transfer methods.
It demonstrates that using automatically generated ICD codes improves diagnosis prediction accuracy, highlighting the potential to boost clinical efficiency in resource-limited languages.