Random Forests in Supervised Machine Learning
- Random Forests are ensemble methods that combine multiple decision trees using bootstrapping and randomized feature selection to reduce variance and improve predictive accuracy.
- They build trees using random subsets of data and features, with splits based on impurity minimization (e.g., Gini impurity) for both classification and regression tasks.
- Their robust performance and versatility have led to applications in fields like economics, astronomy, genomics, and finance, with recent enhancements in online learning and uncertainty quantification.
Random Forests are an ensemble learning method for supervised classification and regression that builds upon the strengths of individual decision trees while mitigating their high variance and instability. In supervised machine learning tasks, Random Forests have achieved broad adoption due to their empirical performance, theoretical properties, interpretability tools, and ability to accommodate high-dimensional, heterogeneous, or incomplete data. The method operates by generating a collection of randomized decision trees trained on bootstrapped samples of the data and aggregating their predictions, typically via majority vote (classification) or averaging (regression). By exploiting randomized feature selection at each split, Random Forests introduce decorrelation among trees and produce robust, high-accuracy predictors across diverse applications, including economics, astronomy, genomics, and time series analysis (Hu et al., 2020, &&&1&&&, Capitaine et al., 2019, Lauretto et al., 2013).
1. Core Random Forest Algorithm and Model Structure
Random Forests, introduced by Breiman (2001), are ensembles of decision trees {h₁(x),…,h_T(x)} built using two sources of randomness: bootstrap sampling of the dataset for each tree and random feature selection at each node split. Formalizing:
- For classification, the forest prediction is
- For regression, it is the average:
Key aspects of tree construction:
- At each node, instead of scanning all features, a random subset of features is selected. Typically, for classification and for regression (Hu et al., 2020).
- Splitting is based on impurity minimization. For classification, the common criterion is Gini impurity:
where is the fraction of class at the node (Kim, 2021, Hu et al., 2020).
- Each tree is grown to full depth or stopped according to user-specified parameters such as minimum leaf size or impurity decrease.
Pseudocode for standard Random Forest training is provided explicitly in (Hu et al., 2020), and error estimation leverages the “out-of-bag” (OOB) approach, whereby each tree is evaluated on the data not included in its own bootstrap sample.
2. Theoretical Foundations and Extensions
Random Forests benefit from aggregation (bagging) as variance reduction, with low correlation among trees fostered by random feature selection, and deep individual trees to minimize bias. Their universal consistency for regression and certain rates for classification, under mild conditions, have been established (Hu et al., 2020). Regularization arises implicitly from the randomization procedures.
Extensions to the classical framework include:
- High-dimensional outputs: Application of random projections allows for computational scaling when target space dimension is large (Joly, 2017).
- Covariate shift adaptation: Locally Optimized Random Forests use nonparametric density-ratio estimates to reweight splits and predictions so as to minimize loss under a different test distribution, improving performance under domain shift (1908.09967).
- Online learning: Aggregated Mondrian Forests (AMF) provide online, parameter-free updating of partition trees with sublinear regret to optimal prunings, exploiting the Mondrian process and Context Tree Weighting for aggregation (Mourtada et al., 2019).
- High-dimensional longitudinal data modeling: Random Forests embedded in semi-parametric mixed effects models (e.g., REEMforest) with covariance-aware modifications yield lower bias and higher accuracy for repeated-measures genomic and clinical data (Capitaine et al., 2019).
- Oblique splits and projection pursuit: Projection-pursuit Forests use linear combinations of features to split nodes, improving performance when class boundaries are not axis-aligned (Silva et al., 2018).
- Sparse data and model compression: Efficient feature selection, random projections, and -based pruning make Random Forests scalable to input/output matrices with high dimensionality and sparsity (Joly, 2017, Mao et al., 2022).
- Local variable importance: Dimension Reduction Forests (DRF) provide local, directional variable importances by combining random forests with local sufficient dimension reduction, offering pointwise insights not possible with standard global importance metrics (Loyal et al., 2021).
3. Model Tuning, Complexity, and Practical Implementation
Principal hyperparameters impacting generalization performance are:
- Number of trees (100–1000 typical; performance plateaus as increases).
- : Number of candidate features at each split, with standard heuristic settings as above.
- Maximum tree depth or minimum samples per leaf (controls complexity and overfitting risk).
- For imbalanced or multiclass problems: class weights or sample rebalancing may be needed (Kim, 2021, Pennock et al., 14 Jan 2025).
Computational complexity per tree is ( = number of samples, tree depth), total . Prediction is per point (Hu et al., 2020).
Random Forests are highly parallelizable both during training (independent trees) and evaluation. Storage and inference costs can be mitigated via ensemble compression (monotone Lasso selection of indicator features) or sparse data representations (Joly, 2017).
4. Interpretability and Feature Importance
Random Forests provide several built-in interpretability mechanisms:
- Mean Decrease in Gini: Measures the total reduction in impurity afforded by each feature, averaged over the ensemble (Kim, 2021).
- Permutation Importance: Assesses the impact on OOB error when a feature is permuted (destroying its association with the label) (Hu et al., 2020).
- Partial Dependence Plots (PDPs): Visualize marginal predicted effects of individual features, averaging out the influence of others.
- Local Variable Importance: Recent methods compute pointwise importances via forest-induced kernels and local covariance matrices (Loyal et al., 2021).
- Concept Lattices: Translate tree-based rules into symbolic, closed-premise rules, encoding all learned conjunctions within a lattice structure for increased interpretability (Dudyrev et al., 2021).
Bias-correction for importance scores, e.g., via shadow features, and hybrid forward-selection with nonparametric two-sample testing, further improve feature relevance ranking and parsimony (Mao et al., 2022).
5. Uncertainty Quantification and Out-of-Distribution Adaptation
Random Forests can be used to distinguish and quantify uncertainty components in supervised prediction:
- Aleatoric uncertainty (irreducible, statistical): Captured by within-leaf entropy or variance.
- Epistemic uncertainty (reducible, model): Quantified by the spread of predictions across trees, mutual information measures, or the variance of tree means (Shaker et al., 2020).
Explicit formulas for entropy-based and likelihood-based decompositions are given, enabling per-instance uncertainty diagnostics. Random forests support abstention strategies under uncertainty, and outperform or match deep neural network ensembles in uncertainty calibration on standard tabular datasets (Shaker et al., 2020).
For covariate shift, weighted adaptations (LORF) minimize risk under target distributions via importance weighting at both tree-growing and prediction aggregation phases. OOB error rate under importance weights remains a consistent estimator of shifted risk (1908.09967).
6. Applications and Empirical Performance
Random Forests have been employed across diverse domains:
- Socioeconomic prediction: Multiclass poverty status in Costa Rica households achieved an F1 of 64.9%, with education variables and dependency ratios emerging as dominant features (Kim, 2021).
- Astronomy: Probabilistic Random Forests classified 130 million extragalactic and stellar sources (VMC, SMC/LMC) with test accuracies up to 98% at high confidence, leveraging missing data and feature uncertainty (Pennock et al., 14 Jan 2025).
- Finance: Stock market operation recommendation models, using technical features, delivered >80% success on advised trades in rolling-window evaluation (Lauretto et al., 2013).
- Genomics and biomedicine: High-dimensional longitudinal analysis over tens of thousands of gene transcripts (Capitaine et al., 2019).
- Online and streaming scenarios: Mondrian forest architectures provide online-adaptive, minimax-optimal learning with strong theoretical and practical performance (Mourtada et al., 2019).
- High-dimensional, multi-label tasks: Random projection and sparsity techniques yield accurate, scalable models for very large output spaces (Joly, 2017).
- Feature selection: Bias-corrected Random Forest importances and deep-kernel forward selection outperform Boruta and minimum-depth methods on both simulation and real-world regression tasks (Mao et al., 2022).
7. Limitations, Variants, and Future Directions
Limitations include:
- Lower interpretability compared to single trees.
- Sensitivity to class imbalance and missing data if not handled explicitly.
- No built-in model-based regularization for boosting or shrinkage.
Proposed and implemented variants address:
- Oblique splitting for improved class separation in correlated feature spaces (Silva et al., 2018).
- Structured aggregation for knowledge discovery (constrained lattices) (Dudyrev et al., 2021).
- Online forests for incremental and streaming learning (Mourtada et al., 2019).
- Local variable importance via adaptive kernel machinery (Loyal et al., 2021).
- Compression and fast inference with large ensembles (Joly, 2017).
Open research areas include sharper theoretical analysis of generalization and variable selection properties, further computational scaling on truly massive or ultra-sparse data, and integration with probabilistic graphical models and causal inference frameworks.
Key references: (Hu et al., 2020, Kim, 2021, Silva et al., 2018, Shaker et al., 2020, Dudyrev et al., 2021, Capitaine et al., 2019, Loyal et al., 2021, 1908.09967, Pennock et al., 14 Jan 2025, Lauretto et al., 2013, Mourtada et al., 2019, Joly, 2017, Mao et al., 2022).