A Comparative Analysis of XGBoost

Published 5 Nov 2019 in cs.LG and stat.ML | (1911.01914v1)

Abstract: XGBoost is a scalable ensemble technique based on gradient boosting that has demonstrated to be a reliable and efficient machine learning challenge solver. This work proposes a practical analysis of how this novel technique works in terms of training speed, generalization performance and parameter setup. In addition, a comprehensive comparison between XGBoost, random forests and gradient boosting has been performed using carefully tuned models as well as using the default settings. The results of this comparison may indicate that XGBoost is not necessarily the best choice under all circumstances. Finally an extensive analysis of XGBoost parametrization tuning process is carried out.

Abstract PDF Upgrade to Chat

Citations (1,112)

View on Semantic Scholar

Summary

The paper demonstrates that XGBoost significantly reduces training time while maintaining competitive accuracy compared to gradient boosting and random forest.
The study details parameter tuning by identifying balanced defaults for learning rate, gamma, and subsampling, optimizing performance while reducing computational costs.
The findings highlight that despite marginal accuracy gains from tuned gradient boosting, XGBoost's efficient structure and robust defaults position it as a competitive ML tool.

A Comparative Analysis of XGBoost

The paper at hand provides an empirical examination of XGBoost, comparing its training speed and accuracy against other ensemble methods like gradient boosting and random forest across various datasets. It explores the essential aspect of parameter tuning, aiming at understanding XGBoost’s setup efficiency and effectiveness.

Methodology

Random Forest and Gradient Boosting

Random Forest (RF) is an ensemble method built upon decision trees with inherent randomization mechanisms. Each tree is trained on a bootstrap sample, and a subset of features is used at each split based on the log transformation rule. This generally renders RF parameter-efficient, as its generalization performance is robust across a vast range of defaults with minimal tuning required.

Gradient Boosting (GB), in contrast, relies on sequentially adding predictors to minimize a loss function. It is highly effective but susceptible to overfitting if not adequately regularized. The tuning process for GB involves adjusting parameters like learning rate, maximum depth, subsampling rate, and others to achieve generalization without overfitting.

XGBoost

XGBoost modifies the traditional gradient boosting by incorporating a regularized objective to control model complexity. It implements several computational enhancements such as column block storage and parallel execution of tree splits, significantly speeding up the training process. Its parameter space includes learning rate, gamma (the node split metric), tree depth, feature sampling, and more, making tuning crucial for optimal performance.

Experimental Results

The experimentation leveraged 28 diverse datasets from UCI to test XGBoost against RF and GB. A 10-fold cross-validation and grid search for parameter tuning were conducted, employing 200 trees per ensemble. Results showed that tuned gradient boosting often outperformed in accuracy but with negligible significance compared to XGBoost and RF default setups, emphasizing the high potential in RF’s parameter robustness.

Notably, XGBoost demonstrated a significant reduction in training time compared to GB, attributed to its calibrated data structure and parallelization capabilities. However, the extensive parameter search constituted the bulk of computational cost, especially in GB and XGBoost setups.

Analysis of XGBoost Parameterization

Further analysis explored XGBoost’s parameter grid, attempting to refine default configurations and parameter grids to enhance performance while reducing unnecessary tunings. It suggested intermediate values for parameters like learning rate, gamma, and subsampling to provide stronger defaults without losing generalization or efficiency. Importantly, tuning the randomization-related parameters did not yield significant performance improvements and incurred computational costs, hence could be fixed without adverse effects.

Conclusion

The study delineates the balance between parameter tuning necessity and computational efficiency. While gradient boosting offered slight accuracy advantages, XGBoost, with its scalable execution and well-rounded parameter defaults, provides compelling competitiveness. It revealed that random forests still excel with minimal tuning requisite, demonstrating reliable accuracy.

The computational trade-offs associated with extensive parameter searches spotlight XGBoost’s prowess in achieving efficient resource use while remaining competitive or surpassing other techniques in performance. This positions XGBoost as a formidable tool in ML tasks across various domains, reaffirming its prevalent use in competitive machine learning. The findings underscore the strategic advantage of parameter optimization, especially in competitive settings, while cautioning against excessive computational expenditure for minimal gain.

Markdown Report Issue