Mining Unstructured Medical Texts With Conformal Active Learning

Published 5 Feb 2025 in cs.CL, cs.LG, and stat.ML | (2502.04372v1)

Abstract: The extraction of relevant data from Electronic Health Records (EHRs) is crucial to identifying symptoms and automating epidemiological surveillance processes. By harnessing the vast amount of unstructured text in EHRs, we can detect patterns that indicate the onset of disease outbreaks, enabling faster, more targeted public health responses. Our proposed framework provides a flexible and efficient solution for mining data from unstructured texts, significantly reducing the need for extensive manual labeling by specialists. Experiments show that our framework achieving strong performance with as few as 200 manually labeled texts, even for complex classification problems. Additionally, our approach can function with simple lightweight models, achieving competitive and occasionally even better results compared to more resource-intensive deep learning models. This capability not only accelerates processing times but also preserves patient privacy, as the data can be processed on weaker on-site hardware rather than being transferred to external systems. Our methodology, therefore, offers a practical, scalable, and privacy-conscious approach to real-time epidemiological monitoring, equipping health institutions to respond rapidly and effectively to emerging health threats.

Abstract PDF Upgrade to Chat

Summary

The paper presents a framework combining active learning with label-conditional conformal prediction to minimize manual labeling and ensure reliable predictions.
It utilizes model-agnostic methods like TF-IDF + XGBClassifier to achieve robust performance in resource-constrained environments.
The active selection of high-uncertainty texts via k-means clustering enhances label diversity and supports real-time deployment in medical text mining.

Conformal Active Learning for Medical Text Mining

The paper "Mining Unstructured Medical Texts With Conformal Active Learning" (2502.04372) introduces a novel framework for extracting relevant information from unstructured medical texts, specifically Electronic Health Records (EHRs), using a combination of active learning and label-conditional conformal prediction. This approach aims to reduce the manual labeling effort required for training classification models while providing reliability through uncertainty quantification. The framework is designed to be model-agnostic, allowing for deployment in resource-constrained environments while addressing privacy concerns.

Methodology: Conformal Active Learning Framework

The core of the proposed methodology is a Conformal Active Learning framework designed to iteratively refine a classification model with minimal manual labeling (Figure 1).

Figure 1: Diagram of the active leaning cycle.

The process starts with an initial set of labels, either guessed or derived from a small subset of manually labeled data. The data is split into training and validation sets. A classification model is trained on the training data, and a label-conditional conformal model is calibrated using the validation set. This conformal model assigns a score to each prediction, reflecting the model's uncertainty about a given label $y$ for a sample text $X$ . The conformity score is calculated as:

$s(x, y) = 1 - \hat{p}(y \ |\ x)$

where $\hat{p}(y | x)$ is the predicted probability of $x$ having label $y$ according to the classification model. Thresholds $t_{y,\alpha}$ for each label $y$ are computed using the validation split to ensure a specified confidence level $\alpha$ . Calibrated set predictions $C_\alpha (X^\mathrm{new})$ are then made for new data points $X^\mathrm{new}$ , consisting of a subset of labels:

$C_\alpha (X^\mathrm{new}) = \{ y : s(X^\mathrm{new}, y) \leq t_{y,\alpha} \}$

This ensures that the predicted labels fall within the specified confidence level $1 - \alpha$ , providing a transparent, uncertainty-aware guarantee of correctness for each classification. The mean conformity score $S_X$ over all predicted labels for a sample text $X$ is used to rank data points by uncertainty:

$S_{X} = \frac{1}{|C_\alpha(X)|} \sum_{y \in C_\alpha(X)} s(X, y)$

The top $k_{top}$ data points with the highest uncertainty scores are selected and clustered using $k$ -means into $k_{cluster}$ groups ( $k_{cluster} < k_{top}$ ). From each cluster, the point closest to the centroid is chosen for manual labeling, ensuring a diverse sample representing a broad range of high-uncertainty predictions. The framework can also incorporate a fraction of low-uncertainty predictions to validate the model's performance on straightforward cases.

Implementation and Deployability

The framework's classification model agnosticism allows training and selection based on available hardware, facilitating on-premise deployment in healthcare institutions. The authors are developing a web interface, OLIM (Open Labeller for Iterative Machine learning), to streamline the iterative labeling process. The code is released as open source under the GPL-v3.0 license, with the entire setup packaged in Docker containers for easy installation.

Experimental Evaluation

The framework was evaluated using a database of Amazon product reviews as a proxy for medical records, due to the sensitive nature of patient information. Four labels were defined for classification: Pet product, Drinkable product, Low quality product, and Damaged product. Experiments were conducted with 100 or 200 manually labeled texts, using $k_{top} = 500$ and $k_{cluster} = 6$ . A 20% subset of labeled data was randomly assigned to the validation dataset in each cycle. The models tested included TF-IDF + XGBClassifier and DeBERTaV3. Additional experiments explored the impact of balancing high and low uncertainty texts and starting with pre-labeled texts.

Results and Discussion

The framework achieved strong performance with the TF-IDF + XGBClassifier model, even with a limited number of manually labeled texts. With 200 manual labels, combining high and low uncertainty in $k_{top}$ selection and adding a small number of pre-labeled texts, the framework demonstrated effectiveness in optimizing labeling efforts while maintaining robust performance. The deep learning model DeBERTaV3 did not perform as well as expected, suggesting that simpler models can deliver competitive results in resource-constrained environments. Incorporating a mix of high and low uncertainty texts during selection significantly improved performance, with the AUC-ROC score increasing substantially when using a 50/50 or 70/30 split. For less frequent labels, such as Damaged product, providing pre-labeled texts to initialize the framework improved performance substantially. Randomly selecting data for more common labels achieved results comparable to the active learning framework, but performance degraded significantly for rare labels.

Figure 2: Convergence of AUC-ROC for the Pet product label, using the XGBBoost model, with only high uncertainty and a 70/30 mix of high and low uncertainty.

Conclusion

The Conformal Active Learning framework effectively reduces manual labeling efforts while ensuring reliable predictions. The framework's model agnosticism and open-source implementation facilitate its deployment in healthcare institutions for privacy-preserving, real-time epidemiological surveillance. Future work will focus on testing with real medical data and refining the framework for practical deployment.

Markdown Report Issue