Extracting Social Support and Social Isolation Information from Clinical Psychiatry Notes: Comparing a Rule-based NLP System and a Large Language Model

Published 25 Mar 2024 in cs.CL | (2403.17199v1)

Abstract: Background: Social support (SS) and social isolation (SI) are social determinants of health (SDOH) associated with psychiatric outcomes. In electronic health records (EHRs), individual-level SS/SI is typically documented as narrative clinical notes rather than structured coded data. Natural language processing (NLP) algorithms can automate the otherwise labor-intensive process of data extraction. Data and Methods: Psychiatric encounter notes from Mount Sinai Health System (MSHS, n=300) and Weill Cornell Medicine (WCM, n=225) were annotated and established a gold standard corpus. A rule-based system (RBS) involving lexicons and a LLM using FLAN-T5-XL were developed to identify mentions of SS and SI and their subcategories (e.g., social network, instrumental support, and loneliness). Results: For extracting SS/SI, the RBS obtained higher macro-averaged f-scores than the LLM at both MSHS (0.89 vs. 0.65) and WCM (0.85 vs. 0.82). For extracting subcategories, the RBS also outperformed the LLM at both MSHS (0.90 vs. 0.62) and WCM (0.82 vs. 0.81). Discussion and Conclusion: Unexpectedly, the RBS outperformed the LLMs across all metrics. Intensive review demonstrates that this finding is due to the divergent approach taken by the RBS and LLM. The RBS were designed and refined to follow the same specific rules as the gold standard annotations. Conversely, the LLM were more inclusive with categorization and conformed to common English-language understanding. Both approaches offer advantages and are made available open-source for future testing.

Abstract PDF HTML Upgrade to Chat

Citations (1)

View on Semantic Scholar

Summary

The paper demonstrates that a rule-based NLP system outperforms a large language model in extracting social support and isolation information from clinical notes.
It utilizes annotated psychiatric notes from two major NYC academic centers, achieving f-scores up to 0.90 for precise category classification.
The study highlights the need for enhanced LLM fine-tuning and hybrid approaches to improve automated extraction of social determinants in healthcare.

Introduction

Understanding Social Determinants of Health (SDOH) such as Social Support (SS) and Social Isolation (SI) can significantly impact psychiatric outcomes. Traditionally, these determinants have been documented in narrative clinical notes rather than structured coded data, posing a challenge for systematic analysis. The innovative use of NLP techniques presents a solution, potentially automating the extraction of SS/SI data from Electronic Health Records (EHRs). This paper explores the development and comparison of a rule-based system (RBS) and a LLM - specifically, FLAN-T5-XL - for identifying mentions of SS and SI in psychiatric encounter notes from two large New York City academic medical centers.

Data and Methods

The study utilized notes from the Mount Sinai Health System (MSHS) and Weill Cornell Medicine (WCM), creating a corpus annotated for SS and SI mentions and their subcategories. Development of the lexicons and manual annotations adhered to rigorous methods, with annotators trained and disagreements adjudicated by a designated expert. Both RBS and LLM utilized these annotations, with the RBS leveraging specific lexicons and exclusion keywords, and the LLM employing FLAN-T5-XL with instruction tuning and minimal fine-tuning.

Results

Across both MSHS and WCM datasets, the RBS showed superior performance in extracting SS/SI information compared to the LLM approach, with higher macro-averaged f-scores observed for both fine- and coarse-grained classifications. For instance, at MSHS, RBS obtained an f-score of 0.90 for fine-grained and 0.89 for coarse-grained categories, while LLM scored 0.62 and 0.65, respectively. Similar patterns were observed at WCM.

Discussion

The higher efficiency of the RBS in identifying SS/SI can be attributed to the system’s careful alignment with manual annotation guidelines and the use of specific lexicons. Interestingly, despite the general anticipation of LLMs outperforming traditional rule-based approaches, the results indicated otherwise. The exploration highlights the nuanced complexity of automatically extracting SDOH information, revealing that while LLMs offer massive potential, their performance heavily relies on the context and fine-tuning specifics.

Implications and Future Directions

The findings underscore the importance of continuing to refine NLP methodologies for SDOH extraction, balancing between rule-based specificity and LLM's broader linguistic understanding. While RBS demonstrates promising precision, the scalability and adaptability of LLM approaches cannot be overlooked. Future research should focus on improving LLM training protocols, consider hybrid models that leverage strengths of both methodologies, and explore the portability of developed systems across varied healthcare settings.

This study contributes significantly to the growing body of research on utilizing NLP to harness clinical narrative data for SDOH insights, laying groundwork for future advancements that could revolutionize patient care paradigms through a deeper understanding of social determinants influencing health outcomes.

Markdown Report Issue