How to Scale XGBoost with Ray for Distributed Multi‑GPU Training
XGBoost‑Ray provides a fault‑tolerant, multi‑node, multi‑GPU backend for XGBoost that integrates seamlessly with Ray Tune, supports distributed data loading, and can be enabled with only three code changes, enabling scalable training and inference on large clusters.
XGBoost‑Ray is a distributed backend for XGBoost that supports multi‑node and multi‑GPU training, advanced fault‑tolerance, distributed data loading, and seamless integration with the hyper‑parameter optimization library Ray Tune, while remaining fully compatible with the core XGBoost API.
Multi‑node and multi‑GPU training
On a Ray cluster, XGBoost‑Ray instantiates a configurable number of trainer actors; each actor trains on an independent data partition (data‑parallel training). Gradient aggregation uses a tree‑based all‑reduce, and multi‑GPU training leverages NCCL2 for cross‑device communication, while inter‑node communication falls back to Rabit.
Distributed hyper‑parameter search
XGBoost‑Ray integrates with Ray Tune by automatically creating callbacks that report training status, checkpoint models, and allocate appropriate resources for each trial based on the distributed training configuration. This allows multiple distributed training jobs to run concurrently, each launching distributed training actors across the Ray cluster.
Fault tolerance and elastic training
XGBoost‑Ray offers two fault‑handling modes. In non‑elastic mode, if a training actor dies (e.g., node failure), the system waits for sufficient resources, restarts the actor, reloads its shared data, and resumes from the last checkpoint while unaffected actors keep their state, avoiding redundant data loading. In elastic mode, training continues with fewer actors and less data; when the failed actor’s resources become available, its data is reloaded and the actor is reintegrated. Accuracy may drop slightly, but with many actors and large datasets the impact is minimal, and total training time is often reduced.
Distributed data frames and loading
XGBoost‑Ray integrates with Modin and Ray‑native MLDataset abstractions. Location‑aware actor scheduling and data sharding minimize network traffic and keep data evenly distributed across actors. It also supports loading distributed data from Parquet (e.g., via Petastorm) or CSV files stored on local disks or cloud storage providers.
Code example: from single‑node to distributed training
pip install sklearn xgboost_raySingle‑node training with core XGBoost:
from xgboost import DMatrix, train
from sklearn.datasets import load_breast_cancer
train_x, train_y = load_breast_cancer(return_X_y=True)
train_set = DMatrix(train_x, train_y)
evals_result = {}
bst = train({
"objective": "binary:logistic",
"eval_metric": ["logloss", "error"],
}, train_set, evals_result=evals_result, evals=[(train_set, "train")], verbose_eval=False)
bst.save_model("model.xgb")Convert to distributed training by changing three lines (import, matrix, RayParams):
from xgboost_ray import RayDMatrix, RayParams, train
from sklearn.datasets import load_breast_cancer
train_x, train_y = load_breast_cancer(return_X_y=True)
train_set = RayDMatrix(train_x, train_y)
evals_result = {}
bst = train({
"objective": "binary:logistic",
"eval_metric": ["logloss", "error"],
}, train_set, evals_result=evals_result, evals=[(train_set, "train")], verbose_eval=False,
ray_params=RayParams(num_actors=2, cpus_per_actor=1))
bst.save_model("model.xgb")Distributed inference
Prediction with the saved model can also be distributed by changing three lines:
from xgboost_ray import RayDMatrix, RayParams, predict
from sklearn.datasets import load_breast_cancer
import xgboost as xgb
data, labels = load_breast_cancer(return_X_y=True)
dpred = RayDMatrix(data, labels)
bst = xgb.Booster(model_file="model.xgb")
predictions = predict(bst, dpred, ray_params=RayParams(num_actors=2))
print(predictions)Scikit‑learn API usage
from xgboost_ray import RayXGBClassifier, RayParams
from sklearn.datasets import load_breast_cancer
X, y = load_breast_cancer(return_X_y=True)
clf = RayXGBClassifier(n_jobs=4) # number of distributed actors
clf.fit(X, y)Conclusion
XGBoost‑Ray delivers distributed XGBoost performance comparable to XGBoost‑Spark and XGBoost‑Dask, with added advantages such as full GPU support, advanced fault‑tolerance, and out‑of‑the‑box integration with Ray Tune for hyper‑parameter optimization.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Code DAO
We deliver AI algorithm tutorials and the latest news, curated by a team of researchers from Peking University, Shanghai Jiao Tong University, Central South University, and leading AI companies such as Huawei, Kuaishou, and SenseTime. Join us in the AI alchemy—making life better!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
