D_TreeEVO: Hybrid Decision Tree Evolution
- D_TreeEVO is a hybrid metaheuristic that combines population-based feature selection via EVO with pruned decision trees for efficient classification in high-dimensional IDS tasks.
- It employs dynamic decay rules and energy barrier dynamics to optimize binary feature masks, significantly reducing feature sets while enhancing accuracy.
- The unified pipeline ensures streamlined preprocessing, rapid model training, and interpretable results, outperforming baseline methods on benchmark intrusion detection datasets.
D_TreeEVO is a hybrid decision tree metaheuristic that integrates the Energy Valley Optimizer (EVO) for population-based feature selection with a conventional (typically pruned) decision tree classifier, targeting high-performance learning, especially for high-dimensional tabular tasks such as intrusion detection in cloud computing environments. D_TreeEVO combines wrapper-based feature selection, evolutionary search, and supervised classification to address key issues of dimensionality, complexity, runtime, and predictive performance in modern data-driven security applications (Al-Husseini, 24 Jun 2025, &&&1&&&).
1. Model Architecture and Workflow
D_TreeEVO consists of two principal components: the Energy Valley Optimizer (EVO) for wrapper-based feature selection and a decision tree (DT) classifier. The overall process follows these steps:
- Data Preprocessing: Remove identifiers (e.g., IPs, timestamps), handle missing values, encode categorical variables, balance classes via downsampling, and scale numeric features using Min–Max normalization.
- Feature Selection via EVO: A population of candidate binary feature-selection masks is maintained; each mask encodes a subset of the total features.
- Wrapper Evaluation: For each mask, a decision tree is trained (typically with cross-validation), and performance metrics (e.g., accuracy, detection rate, false positive/negative rates) are computed.
- Population Update: EVO moves candidate masks according to energy-barrier dynamics and neighborhood relationships, balancing exploration and exploitation.
- Final Selection: After convergence, the best mask is used to train a final decision tree classifier on the selected features.
- Model Evaluation: Predictive performance is measured on a held-out test set using accuracy, F1, detection/recall, and false alarm rates.
This modular workflow enables fast training, interpretable models, and robust handling of large-scale, high-dimensional data.
2. Energy Valley Optimizer (EVO): Algorithmic Formulation
EVO operates as a population-based metaheuristic, where each population member (particle) is a binary selection vector indicating the active features. The objective is to optimize a cost function —for IDS, typically a weighted sum of , false positive rate (FPR), and false negative rate (FNR):
Particles update according to four decay-mimetic rules (alpha, gamma, two beta variants), combining exploitation (movement toward the global best mask ), exploration (perturbation by random neighbor ), and neighborhood/centroid terms. Update steps are stochastically weighted and thresholded back to binary.
Update equations:
- Alpha decay:
- Gamma decay:
- Beta decay, centroid:
- Beta decay, neighbor:
where are random reals, is the current best solution, is the centroid, and is a stability factor.
Empirically, EVO reduces feature sets (e.g., CIC-DDoS2019: 88 → 38 features, CSE-CIC-IDS2018: 80 → 43), improving both computational efficiency and classification rates (Alhusseini et al., 3 Jan 2026).
3. Decision Tree Component: Splitting and Hyperparameters
The DT classifier uses either Gini impurity,
or entropy/information gain. Splits are greedy, maximizing Gini or IG over features selected by EVO. The recommended hyperparameters, based on evaluation on cloud-IDS streams, are:
- max_depth = 10 (to prevent overfitting),
- min_samples_split = 0.05 × ,
- min_samples_leaf = 0.02 × ,
- no post-pruning (early stopping via above constraints).
This configuration balances tree expressivity with statistical robustness, particularly in high-dimensional settings after feature selection.
4. End-to-End D_TreeEVO Pipeline: Practical Realization
Pipeline steps are as follows:
- Data loading: Ingestion from public IDS datasets (CIC-DDoS2019, CSE-CIC-IDS2018, NSL-KDD), typically downsampling to manage class imbalance.
- Preprocessing: Removal of redundant information, imputation, encoding, balance correction, and scaling.
- Feature selection: EVO run with population size and iteration count set for dataset scale (e.g., 32 run ensembles in benchmarking).
- Model training: Stratified 80/20 train/test split; DT fitted on features selected by EVO.
- Evaluation protocol: Metrics computed per run—mean and standard deviation across 24–32 repeats (random seeds).
All steps are implemented in a unified workflow, achieving high computational efficiency and interpretability.
5. Empirical Benchmarking and Comparative Performance
D_TreeEVO demonstrates SOTA performance on benchmark IDS datasets when compared with baseline ML and metaheuristic combinations.
| Dataset | Model | Accuracy | F1-score | # Features |
|---|---|---|---|---|
| CIC-DDoS2019 | D_TreeEVO | 99.13% | 98.94% | 38 |
| CIC-DDoS2019 | SVMEVO | 95.60% | 94.99% | 38 |
| CIC-DDoS2019 | RFEVO | 95.86% | 95.34% | 38 |
| CSE-CIC-IDS2018 | D_TreeEVO | 99.78% | 99.70% | 43 |
| CSE-CIC-IDS2018 | SVMEVO | 98.50% | 98.51% | 43 |
Confusion matrices show >99% correct classification for most classes, with rare misclassification (<0.3%). D_TreeEVO outperforms both deep learning and other hybrid methods on these benchmarks (Al-Husseini, 24 Jun 2025, Alhusseini et al., 3 Jan 2026).
6. Analysis of Trade-offs, Limitations, and Observed Behavior
- EVO vs GWO: EVO achieves faster and more reliable convergence than Grey Wolf Optimizer (GWO) in IDS feature selection, due to its dynamic energy/barrier landscape and combined centroid/global/neighborhood updates (Al-Husseini, 24 Jun 2025).
- Detection Rate vs False Alarm: D_TreeEVO yields slightly lower detection rates but significantly reduced false alarm rates compared to GWO-based approaches, a trade-off suitable for operational IDS contexts.
- Computational overhead: EVO incurs additional metaheuristic search cost but leads to substantially reduced dimensionality, improving training/inference time post-selection.
- Robustness: No formal significance tests are reported, but the consistency of gains across runs and low error rates are suggestive of true improvement.
- Scalability: D_TreeEVO's ability to operate on large-scale, imbalanced datasets with high-dimensional feature spaces demonstrates practical viability.
7. Limitations, Open Questions, and Prospects
- Hyperparameter tuning: EVO-specific meta-parameters (population size, decay coefficients, iteration count) must be tuned per dataset.
- Generalization: The approach is sensitive to very low-frequency class instances; extending to ensembles or deep learners is posed as future work (Al-Husseini, 24 Jun 2025, Alhusseini et al., 3 Jan 2026).
- Online adaptation: The current workflow does not support online/streaming feature re-selection; incremental model adaptation under concept drift is an open avenue.
- Significance: While empirical differences are substantial, formal statistical validation (e.g., McNemar’s, paired -test) is not reported.
A plausible implication is that D_TreeEVO, by integrating a highly adaptive evolutionary feature selector with statistically robust tree learners, provides a flexible, interpretable, and performant solution for high-dimensional classification domains, especially in security-sensitive cloud environments. The demonstrated rapid convergence and consistent gains across multiple datasets position D_TreeEVO as a leading framework for hybrid metaheuristic-ML pipelines (Al-Husseini, 24 Jun 2025, Alhusseini et al., 3 Jan 2026).