Validity of Feature Importance in Low-Performing Machine Learning for Tabular Biomedical Data

Published 20 Sep 2024 in stat.ML and cs.LG | (2409.13342v1)

Abstract: In tabular biomedical data analysis, tuning models to high accuracy is considered a prerequisite for discussing feature importance, as medical practitioners expect the validity of feature importance to correlate with performance. In this work, we challenge the prevailing belief, showing that low-performing models may also be used for feature importance. We propose experiments to observe changes in feature rank as performance degrades sequentially. Using three synthetic datasets and six real biomedical datasets, we compare the rank of features from full datasets to those with reduced sample sizes (data cutting) or fewer features (feature cutting). In synthetic datasets, feature cutting does not change feature rank, while data cutting shows higher discrepancies with lower performance. In real datasets, feature cutting shows similar or smaller changes than data cutting, though some datasets exhibit the opposite. When feature interactions are controlled by removing correlations, feature cutting consistently shows better stability. By analyzing the distribution of feature importance values and theoretically examining the probability that the model cannot distinguish feature importance between features, we reveal that models can still distinguish feature importance despite performance degradation through feature cutting, but not through data cutting. We conclude that the validity of feature importance can be maintained even at low performance levels if the data size is adequate, which is a significant factor contributing to suboptimal performance in tabular medical data analysis. This paper demonstrates the potential for utilizing feature importance analysis alongside statistical analysis to compare features relatively, even when classifier performance is not satisfactory.

Abstract PDF Upgrade to Chat

Summary

The paper challenges the belief that high model performance is a prerequisite, demonstrating that valid feature importance can be achieved with adequate data.
It employs synthetic and real biomedical datasets using Random Forests and stability indexes to quantify the impact of data and feature reduction on ranking reliability.
The findings guide researchers in discerning performance issues caused by insufficient data versus limited features, enhancing practical applications in biomedical ML.

Validity of Feature Importance in Low-Performing Machine Learning for Tabular Biomedical Data

Overview

In the paper titled "Validity of Feature Importance in Low-Performing Machine Learning for Tabular Biomedical Data," the authors, Youngro Lee, Giacomo Baruzzo, Jeonghwan Kim, Jongmo Seo, and Barbara Di Camillo, present a critical examination of the common assumption that high model performance is a prerequisite for valid feature importance analysis in the context of biomedical data. The paper's central thesis challenges this belief, suggesting that even low-performing models can yield valid feature importance rankings if the data size is sufficient.

Methodology

The authors employ both synthetic and real-world biomedical datasets to investigate feature importance under conditions of varying model performance. Their experiments involve reducing either the number of samples (data cutting) or the number of features (feature cutting) and observing the resulting changes in feature ranking. Specifically, the study utilizes:

Three Synthetic Datasets: These datasets are designed to provide clear feature rankings and controlled experimental settings.
Six Real Biomedical Datasets: These datasets include a range of complexity, sample sizes, and feature numbers, making the findings more generalizable to real-world applications.

Random Forest models serve as the predictive framework, and the Area Under the ROC Curve (AUC) is used as the performance metric. Stability indexes such as rank difference, Spearman’s rank correlation (SRCC), Canberra distance (CD), and Bray–Curtis distance are utilized to measure changes in feature importance rankings.

Key Findings

Stability of Feature Importance in Synthetic Data: The study finds that feature cutting does not significantly alter feature rankings, whereas data cutting leads to higher discrepancies as performance degrades. This result suggests that models can still effectively distinguish feature importance in the absence of a sufficient number of features, but not without adequate data.
Real-World Data Observations: In real biomedical datasets, feature cutting shows similar or better stability compared to data cutting in most cases. However, some datasets exhibit the opposite trend, indicating that the specifics of the dataset can significantly impact the outcome.
Controlled Feature Interactions: When feature interactions are minimized by removing correlations, feature cutting consistently demonstrates better stability. This finding underscores the importance of accounting for feature interactions in validity assessments.
Feature Importance Distribution: Analyzing the distribution of feature importance values reveals that lower performance due to data cutting leads to a flattening of these distributions, while feature cutting maintains distinct importance values even at lower performance levels.

Theoretical Implications

The authors provide a probabilistic framework to explain why low performance due to a lack of data impacts feature importance validity more than a lack of features. They calculate the probability that a model can distinguish between the importance of features based on the sample size and feature differences. This theoretical analysis supports the empirical findings, showing that insufficient data leads to higher variability and less reliable feature importance rankings.

Practical Implications

The findings have significant implications for the application of machine learning in biomedical research:

Independent Feature Importance Analysis: The results suggest that feature importance can be meaningfully analyzed even in low-performing models, provided the data size is adequate.
Data Sufficiency Assessment: Researchers can employ the proposed methods to assess whether performance issues stem from insufficient data or a lack of features, guiding more effective model improvements.
Feature Importance Validation: Statistical tests to compare feature importance values can be used alongside traditional feature ranking methods to ensure the robustness of findings, particularly in low-performing models.

Future Directions

The study opens several avenues for future research:

Broader Dataset Analysis: Further studies could include a more extensive range of datasets to validate the generalizability of these findings.
Alternative Models and Metrics: Exploring different machine learning models and performance metrics could provide additional insights into the stability of feature importance across various contexts.
Feature Interaction and Non-Linearity: Deeper investigations into the effects of feature interactions and non-linearities in the data could refine the understanding of feature importance validity.

Conclusion

This study challenges the entrenched notion that high model accuracy is a prerequisite for valid feature importance analysis in biomedical machine learning applications. The findings underscore that while low-performing models may still provide valid feature importance rankings, the adequacy of the data size is a critical factor. This nuanced understanding paves the way for more robust and meaningful applications of machine learning in biomedical research, even in scenarios where model performance is suboptimal.

Markdown Report Issue