RFECV-Selected Feature Subset in Ransomware Detection
- RFECV-selected Feature Subset is derived through iterative elimination using cross-validated accuracy, ensuring only the most informative features are retained.
- The method is applied separately to API-call frequency features and network-traffic metrics, resulting in reduced computational cost and improved model efficiency.
- Despite achieving 40–50% dimensionality reduction, RFECV may exclude critical domain-specific features, necessitating complementary interpretability techniques like SHAP.
A Recursive Feature Elimination with Cross-Validation (RFECV)-Selected Feature Subset denotes the set of input variables retained for use by a machine learning classifier after iterative elimination of less informative features, guided by cross-validated estimator performance. In the context of ransomware classification, RFECV is applied separately to both API-call frequency features and network-traffic metrics across several supervised classification models. The resultant subsets optimize model parsimony, reduce computational cost, and can improve generalization, but may also risk excluding domain-significant variables. The following sections examine the RFECV methodology, its application, subset characteristics, and practical consequences in detail (Mowri et al., 2022).
1. RFECV Process and Mathematical Basis
RFECV operates by recursively removing candidate features and assessing the effect on classifier performance through cross-validation. At each iteration, a base estimator is trained using all surviving features, and the feature with the lowest contribution (per estimator criteria) is removed. The evaluation relies on -fold cross-validation with the metric set to accuracy. Performance at each candidate subset is computed as
where represents the -th fold, is the estimator trained on the remaining four folds using features , and is classification accuracy. Elimination continues until a minimum number of features—half the original set—is reached (e.g., 34 for API dataset, 9 for network-traffic dataset). Empirical tuning of hyperparameters (including the classifier) is performed via nested RandomizedSearchCV within the training fold split. The final feature subset maximizes cross-validated accuracy over all candidate dimensions.
2. Datasets and Feature Baseline
RFECV selection is evaluated on two distinct datasets:
- API-call dataset ("Data1"): 1,460 ransomware PE files from 15 families; each sample is described by the frequency of 68 distinct Windows API calls.
- Network-traffic dataset ("Data2"): 856 PCAP traces; each sample is represented by 18 TCP/HTTP/DNS features, e.g., packet byte counts, TCP FIN/RST counts, DNS query/response metrics.
Both datasets serve as typical high-dimensional domains for malware detection, with substantial feature redundancy and potential noisiness.
3. Composition of RFECV-Selected Feature Subsets
The cardinality and composition of the final RFECV-selected subset varies by classifier and data modality:
API-Call Features (Data1)
The number of features selected per classifier ranges from 37 (SVM) to 40 (KNeighbors). While there is a heavily overlapping core set—including frequently invoked memory, file, and process-related APIs (e.g., NtAllocateVirtualMemory, NtCreateFile, NtDelayExecution)—certain APIs are retained, dropped, or swapped depending on the estimator (e.g., KNeighbors uniquely retains FindWindowExW, NB selects NtDeleteValueKey, RF drops and adds specific query APIs).
Network-Traffic Features (Data2)
Subset sizes range from 10 (SGD, KNeighbors) to 16 (GaussianNB). Multiclass Logistic Regression retains the broadest suite (14 features), capturing both bidirectional TCP states, HTTP/HTTPS, and DNS characteristics; tree-based and SVM models tend to focus slightly more aggressively, with minor symmetrical/asymmetrical inclusion of FIN/RST counts or omission of server-side byte metrics.
A table summarizing subset cardinalities is shown below:
| Classifier | Data1: # Features | Data2: # Features |
|---|---|---|
| LogisticRegression | 38 | 14 |
| SGDClassifier | 38 | 10 |
| KNeighbors | 40 | 10 |
| GaussianNB | 38 | 16 |
| RandomForest | 38 | 13 |
| SVM | 37 | 13 |
The features consistently selected suggest the centrality of certain API invocation patterns and network behaviors across families and classifier types.
4. Impact on Model Performance and Efficiency
RFECV yields a reduction in feature dimensionality by approximately 40–50%, with practical ramifications for model accuracy and computational efficiency.
- Accuracy/F1-Score: For both datasets, all tested classifiers exhibit a consistent, albeit small, reduction in classification accuracy on the held-out (20%) test split. The drop ranges from 0.29–2.26 percentage points (see table below), with the greatest loss observed in SGD (API: –2.02%, Network: –1.07%) and KNeighbors (Network: –2.26%). F₁-scores reflect similar decrements.
| Classifier | API Acc w/ FS | API Acc w/o FS | ΔAcc (API) | Network Acc w/ FS | Network Acc w/o FS | ΔAcc (Network) |
|---|---|---|---|---|---|---|
| LR | 98.20% | 99.30% | –1.10% | 92.25% | 94.04% | –1.79% |
| SGD | 90.43% | 92.45% | –2.02% | 81.69% | 82.76% | –1.07% |
| KNN | 89.62% | 90.52% | –0.90% | 80.99% | 83.25% | –2.26% |
| NB | 97.17% | 97.46% | –0.29% | 97.89% | 98.95% | –1.06% |
| RF | 91.51% | 92.78% | –1.27% | 78.87% | 79.96% | –1.09% |
| SVM | 94.34% | 95.58% | –1.24% | 92.25% | 93.90% | –1.65% |
- Processing Time: Training and inference duration are reduced by approximately 27% (API calls) and 35% (network data) post-RFECV.
This suggests an effective trade-off between efficiency and moderate loss of predictive power, characteristic of aggressive dimensionality reduction in high-noise scenarios.
5. Characteristics and Limitations of RFECV-Selected Subsets
RFECV-Selected Feature Subsets, while smaller and more computationally tractable, have notable properties and caveats:
- Dimensionality reduction is reliable at the 40–50% feature elimination mark for both data modalities.
- Loss of accuracy and increased false positives are observed uniformly across classifiers. An important origin is that some deselected features are identified (via SHAP) as high-impact for classification, but not retained by the wrapper-based RFECV protocol, indicating a potential risk in overaggressive parsimony.
- Lack of internal ranking among retained features: Within the final subset, all features are coded as "rank 1" without further importance discrimination (a limitation of the scikit-learn implementation).
- Potential exclusion of behaviorally critical features: Since wrapper methods optimize for global accuracy rather than domain-informed criteria, RFECV may prune features vital for explainability or behavioral interpretability.
A plausible implication is that RFECV, if applied without supplementary interpretability constraints, can inadvertently degrade model utility for security use cases demanding forensic traceability or defense against evasion.
6. Recommendations and Extensions
The integration of RFECV with model-agnostic importance scorers such as SHAP is recommended to guard against loss of high-value explanatory features. A hybrid approach may reconcile the computational and overfitting benefits of RFECV with the interpretability and robustness requirements of operational security systems.
For the classification of ransomware, off-the-shelf RFECV should not be exclusively relied upon for final feature selection. Results support consideration of additional post-selection validation, using domain insights or impact analyses, to prioritize features not only for statistical performance, but also for relevance to behavioral threat modeling and explainability (Mowri et al., 2022).