Artificial Intelligence 5 min read

How to Accelerate XGBoost Training with Tree Methods, Cloud Computing, and Ray

The article explains why XGBoost training can be slow despite its speed focus and presents three acceleration techniques—choosing an optimal tree_method, leveraging cloud resources for larger memory, and using Ray for distributed training—complete with code examples and benchmark results.

Code DAO

Dec 17, 2021

How to Accelerate XGBoost Training with Tree Methods, Cloud Computing, and Ray

Gradient boosting is widely used for supervised learning, and XGBoost is the open‑source implementation optimized for speed, yet training can still be slow.

Changing the tree construction method

XGBoost’s tree_method parameter selects the algorithm used to build trees ( exact, approx, hist, gpu_hist, auto). Choosing the appropriate method for the task can significantly reduce training time. The article shows a benchmark where switching from hist to gpu_hist roughly halves the runtime.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
import time

X, y = make_classification(n_samples=100000, n_features=1000,
                           n_informative=50, n_redundant=0,
                           random_state=1)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.50, random_state=1)

evalset = [(X_train, y_train), (X_test, y_test)]

results = []
methods = ['exact', 'approx', 'hist', 'gpu_hist', 'auto']
for method in methods:
    model = XGBClassifier(learning_rate=0.02,
                         n_estimators=50,
                         objective="binary:logistic",
                         use_label_encoder=False,
                         tree_method=method)
    start = time.time()
    model.fit(X_train, y_train, eval_metric='logloss', eval_set=evalset)
    end = time.time()
    results.append(method + " Fit Time: " + str(end-start))
print(results)

If the operating system lacks native GPU support, the gpu_hist option should be omitted.

Leveraging cloud computing

Running XGBoost on cloud instances provides access to larger memory pools, which can accommodate bigger datasets and reduce paging overhead.

Distributed XGBoost on Ray

Ray is a distributed framework that also offers a machine‑learning library. XGBoost‑Ray extends the native XGBoost API, allowing a single‑node script to scale to hundreds of nodes with multiple GPUs. The gradient exchange uses NCCL2, while inter‑node coordination relies on Rabit.

from xgboost_ray import RayXGBClassifier, RayParams
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

seed = 42
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, train_size=0.25, random_state=seed)

clf = RayXGBClassifier(n_jobs=4, random_state=seed)
clf.fit(X_train, y_train)
pred_ray = clf.predict(X_test)
print(pred_ray)
pred_proba_ray = clf.predict_proba(X_test)
print(pred_proba_ray)

The example demonstrates that only minimal code changes are required to switch from local XGBoost to distributed training.

In summary, the article presents three ways to speed up XGBoost training: selecting an optimal tree_method, using cloud resources for larger memory, and employing Ray for distributed execution.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

cloud computing distributed training Ray XGBoost gpu_hist tree_method

Written by

Code DAO

We deliver AI algorithm tutorials and the latest news, curated by a team of researchers from Peking University, Shanghai Jiao Tong University, Central South University, and leading AI companies such as Huawei, Kuaishou, and SenseTime. Join us in the AI alchemy—making life better!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.