Artificial Intelligence 8 min read

How to Scale XGBoost with Ray for Distributed Multi‑GPU Training

XGBoost‑Ray provides a fault‑tolerant, multi‑node, multi‑GPU backend for XGBoost that integrates seamlessly with Ray Tune, supports distributed data loading, and can be enabled with only three code changes, enabling scalable training and inference on large clusters.

Code DAO

Dec 17, 2021

How to Scale XGBoost with Ray for Distributed Multi‑GPU Training

XGBoost‑Ray is a distributed backend for XGBoost that supports multi‑node and multi‑GPU training, advanced fault‑tolerance, distributed data loading, and seamless integration with the hyper‑parameter optimization library Ray Tune, while remaining fully compatible with the core XGBoost API.

Multi‑node and multi‑GPU training

On a Ray cluster, XGBoost‑Ray instantiates a configurable number of trainer actors; each actor trains on an independent data partition (data‑parallel training). Gradient aggregation uses a tree‑based all‑reduce, and multi‑GPU training leverages NCCL2 for cross‑device communication, while inter‑node communication falls back to Rabit.

Distributed hyper‑parameter search

XGBoost‑Ray integrates with Ray Tune by automatically creating callbacks that report training status, checkpoint models, and allocate appropriate resources for each trial based on the distributed training configuration. This allows multiple distributed training jobs to run concurrently, each launching distributed training actors across the Ray cluster.

Fault tolerance and elastic training

XGBoost‑Ray offers two fault‑handling modes. In non‑elastic mode, if a training actor dies (e.g., node failure), the system waits for sufficient resources, restarts the actor, reloads its shared data, and resumes from the last checkpoint while unaffected actors keep their state, avoiding redundant data loading. In elastic mode, training continues with fewer actors and less data; when the failed actor’s resources become available, its data is reloaded and the actor is reintegrated. Accuracy may drop slightly, but with many actors and large datasets the impact is minimal, and total training time is often reduced.

Distributed data frames and loading

XGBoost‑Ray integrates with Modin and Ray‑native MLDataset abstractions. Location‑aware actor scheduling and data sharding minimize network traffic and keep data evenly distributed across actors. It also supports loading distributed data from Parquet (e.g., via Petastorm) or CSV files stored on local disks or cloud storage providers.

Code example: from single‑node to distributed training

pip install sklearn xgboost_ray

Single‑node training with core XGBoost:

from xgboost import DMatrix, train
from sklearn.datasets import load_breast_cancer

train_x, train_y = load_breast_cancer(return_X_y=True)
train_set = DMatrix(train_x, train_y)

evals_result = {}
bst = train({
    "objective": "binary:logistic",
    "eval_metric": ["logloss", "error"],
}, train_set, evals_result=evals_result, evals=[(train_set, "train")], verbose_eval=False)
bst.save_model("model.xgb")

Convert to distributed training by changing three lines (import, matrix, RayParams):

from xgboost_ray import RayDMatrix, RayParams, train
from sklearn.datasets import load_breast_cancer

train_x, train_y = load_breast_cancer(return_X_y=True)
train_set = RayDMatrix(train_x, train_y)

evals_result = {}
bst = train({
    "objective": "binary:logistic",
    "eval_metric": ["logloss", "error"],
}, train_set, evals_result=evals_result, evals=[(train_set, "train")], verbose_eval=False,
    ray_params=RayParams(num_actors=2, cpus_per_actor=1))
bst.save_model("model.xgb")

Distributed inference

Prediction with the saved model can also be distributed by changing three lines:

from xgboost_ray import RayDMatrix, RayParams, predict
from sklearn.datasets import load_breast_cancer
import xgboost as xgb

data, labels = load_breast_cancer(return_X_y=True)
dpred = RayDMatrix(data, labels)
bst = xgb.Booster(model_file="model.xgb")
predictions = predict(bst, dpred, ray_params=RayParams(num_actors=2))
print(predictions)

Scikit‑learn API usage

from xgboost_ray import RayXGBClassifier, RayParams
from sklearn.datasets import load_breast_cancer

X, y = load_breast_cancer(return_X_y=True)
clf = RayXGBClassifier(n_jobs=4)  # number of distributed actors
clf.fit(X, y)

Conclusion

XGBoost‑Ray delivers distributed XGBoost performance comparable to XGBoost‑Spark and XGBoost‑Dask, with added advantages such as full GPU support, advanced fault‑tolerance, and out‑of‑the‑box integration with Ray Tune for hyper‑parameter optimization.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

fault tolerance GPU distributed training Ray Ray Tune XGBoost-Ray

Written by

Code DAO

We deliver AI algorithm tutorials and the latest news, curated by a team of researchers from Peking University, Shanghai Jiao Tong University, Central South University, and leading AI companies such as Huawei, Kuaishou, and SenseTime. Join us in the AI alchemy—making life better!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.