Out-of-Core GPU Gradient Boosting

Published 19 May 2020 in cs.LG, cs.DC, and stat.ML | (2005.09148v1)

Abstract: GPU-based algorithms have greatly accelerated many machine learning methods; however, GPU memory is typically smaller than main memory, limiting the size of training data. In this paper, we describe an out-of-core GPU gradient boosting algorithm implemented in the XGBoost library. We show that much larger datasets can fit on a given GPU, without degrading model accuracy or training time. To the best of our knowledge, this is the first out-of-core GPU implementation of gradient boosting. Similar approaches can be applied to other machine learning algorithms

Abstract PDF Upgrade to Chat

Citations (2)

View on Semantic Scholar

Summary

The paper introduces an out-of-core GPU gradient boosting technique integrated into XGBoost, addressing GPU memory limits effectively.
It employs incremental quantile generation and external ELLPACK matrix construction to efficiently handle large datasets.
Empirical results reveal comparable AUC scores and training speeds to in-core methods, even at reduced sampling rates.

Overview of "Out-of-Core GPU Gradient Boosting"

This paper introduces an out-of-core GPU-based implementation of gradient boosting, integrated into the XGBoost library. It addresses the limitations of GPU memory by enabling training on datasets larger than GPU memory capacity without sacrificing model performance or significantly affecting training time. The implementation is particularly focused on overcoming the PCIe bottleneck that arises from continuous data swapping between the main memory and the GPU.

Implementation Details

The core contribution of this research is the integration of out-of-core operations into XGBoost's GPU version. The implementation comprises several key algorithmic techniques tailored to manage memory efficiently and maintain computational speed. These include:

Incremental Quantile Generation: The transformation of input features into quantile representations is adapted to an out-of-core context. Data is read in smaller batches, and the quantile sketch algorithm is applied incrementally.

def out_of_core_quantile_sketch(data):
    for page in data:
        for batch in page:
            # Transfer data to GPU memory
            transfer_to_gpu(batch)
            # Process feature columns in batch
            for column in batch.columns:
                cuts = find_column_cuts(batch, column)
                store_cuts(cuts)

External ELLPACK Matrix Construction: Large datasets, typically not fitting into GPU memory, are managed by partitioning data into manageable ELLPACK pages, which facilitates efficient feature binning and compression.

def construct_external_ellpack(data, cuts):
    ellpack_pages = []
    for page in data:
        # Allocate and convert CSR to ELLPACK format
        ellpack_page = convert_to_ellpack(page, cuts)
        # Store compressed ELLPACK pages to external storage
        ellpack_pages.append(ellpack_page)
        save_page_to_disk(ellpack_page)
    return ellpack_pages

Incremental Tree Construction with Gradient Sampling: The implementation employs a sampling strategy (Minimal Variance Sampling, MVS) that selects gradients based on their magnitudes, mitigating memory and computational overhead. The MVS approach allows aggressive sampling while maintaining model accuracy.
Out-of-Core Tree Construction: With the data managed in ELLPACK pages, the tree construction involves selectively loading and processing these pages from storage, ensuring GPU memory is utilized effectively.

Empirical Results

The implementation demonstrated that the new out-of-core GPU algorithm could train on datasets significantly larger than what in-core methods would permit. Performance evaluation on datasets like the Higgs dataset indicates that the model’s accuracy remains comparable across various sampling rates. Specifically, sampling rates as low as $f = 0.1$ resulted in a slight decrease in model performance while providing substantial data scaling advantages.

Here are the results summarized based on the case study:

Training Time: The out-of-core GPU was comparable to its in-core counterpart in terms of speed but offered substantial speedups over CPU-based implementations.
Accuracy: Maintaining an AUC score close to that of the in-core GPU version even at lowered sampling ratios suggests that the accuracy impact is minimal.

Discussion and Future Work

The implementation highlights a promising area for optimizing machine learning workflows, particularly where data size outpaces available hardware resources. Out-of-core processing on a single GPU, as demonstrated, is a viable alternative to distributed computing for large-scale datasets. Future research might explore broader applications to other ML algorithms and enhance support for distributed GPU clusters integrating similar out-of-core computational strategies.

In summary, the paper successfully presents a strategy to enable the training of larger models on GPUs by overcoming memory constraints while maintaining efficiency and scalability. This advancement is integrated into the XGBoost library, ensuring wide accessibility for practitioners and researchers.