Ray-native XGBoost Training Platform: Architecture, Performance, and Technical Challenges
Didi’s new Ray‑native XGBoost training platform replaces the fault‑prone Spark solution with a fully Pythonic, fault‑tolerant architecture that leverages Ray’s autoscaling and gang‑scheduling, delivering 2–6× speedups, reduced failure rates, efficient sparse‑vector handling, scalable hyper‑parameter search, and improved resource utilization for large‑scale machine‑learning workloads.
As a core representative of machine learning models, XGBoost plays a crucial role in many of Didi's strategy algorithm scenarios. Ensuring and continuously improving the stability of XGBoost offline training and online inference has been a key focus of the machine learning platform. However, the existing XGBoost‑On‑Spark solution suffers from limited fault tolerance, low training efficiency on large‑scale data, and poor integration with the Python‑centric AI ecosystem.
The main issues of the Spark‑based solution are:
Missing task‑level fault tolerance; failures are frequent under high cluster load.
Complex performance‑tuning parameters for billion‑scale data, making troubleshooting difficult.
Implementation in the JVM ecosystem makes integration with mainstream AI libraries (TensorFlow, PyTorch) cumbersome; most algorithm engineers work in Python.
Stuck on XGBoost 0.82 while the community has moved to 2.0, preventing use of new features and optimizations.
Insufficient hyper‑parameter search capabilities; existing Spark‑based pipelines cannot easily incorporate advanced AutoML methods.
After evaluating various MLOps solutions, the team chose a Ray‑native approach. Ray provides native support for XGBoost, LightGBM, TensorFlow, PyTorch, and offers built‑in fault tolerance, a rich AI ecosystem, and a fully Pythonic interface, lowering the entry barrier for algorithm engineers.
Technical architecture (see the original diagram): the server parses training task parameters, creates a RayJob instance, and watches its status. Each training task launches an isolated Ray cluster with custom containers, achieving pod‑level isolation. Resource scheduling uses Volcano with a gang‑scheduling policy combined with Ray’s autoscaler to request resources dynamically while avoiding deadlocks. Training logs are written to a host‑mounted volume and streamed via a Log Server Daemon.
Performance gains : compared with the original XGBoost‑On‑Spark, Ray‑based training achieves 2‑6× speedup across data sizes ranging from 1 M to 242 M samples, and reduces failure rates from 4.6% to 1.6% under stress tests.
Fault‑tolerance mechanisms :
Non‑elastic training: failed actors are rescheduled and training resumes from the latest checkpoint.
Elastic training: remaining actors continue training while failed actors are restarted; once recovered, they re‑join the job using the latest checkpoint.
Key code snippets illustrating the implementation:
struct CompositeSubColumns {
std::shared_ptr
type;
std::shared_ptr
size;
std::shared_ptr
indices;
std::shared_ptr
values;
}; template
class ListColumn : public Column {
static constexpr float kNaN = std::numeric_limits
::quiet_NaN();
public:
ListColumn(size_t idx, size_t length, size_t null_count,
const uint8_t* bitmap, const T* data, float missing,
const uint32_t* offset)
: Column{idx, length, null_count, bitmap}, data_{data}, missing_{missing}, offset_{offset} {}
std::vector
GetElements(size_t row_idx) const override {
if (IsValidElement(row_idx)) {
auto slot_idx = offset_[row_idx];
auto slot_length = offset_[row_idx + 1] - slot_idx;
std::vector
fv; fv.reserve(slot_length);
for (size_t i = 0; i < slot_length; ++i) {
float val = static_cast
(data_[slot_idx + i]);
fv.emplace_back(COOTuple{row_idx, i, val});
}
return fv;
}
return {};
}
private:
const T* data_;
float missing_;
const uint32_t* offset_;
};Hyper‑parameter search is handled by Ray Tune, which integrates seamlessly with the Ray cluster. It supports Bayesian optimization, genetic algorithms, and is compatible with external frameworks such as Optuna and FLAML, enabling efficient large‑scale AutoML.
Technical challenges and solutions :
Spark sparse vector support : To avoid costly conversions (PyArrow → Pandas → scipy.sparse → DMatrix), the team added direct Arrow‑to‑DMatrix handling for high‑dimensional sparse features, defining custom column types (CompositeSubColumns, ListColumn, FixedListColumn).
CPU multi‑core utilization : The original DMatrix constructor ignored the nthread parameter, preventing multi‑core usage. The fix sets ctx_.nthread = nthread and adds an OpenMP pragma for parallel import.
OOM due to duplicate DMatrix loading : The evaluation dataset previously re‑loaded the training set, causing memory blow‑up. The fix checks for duplication and re‑uses the existing DMatrix.
Resource contention and cluster deadlock : Ray’s autoscaler requests resources one‑by‑one, violating Volcano’s gang‑scheduling “all‑or‑nothing” rule. The team modified the autoscaler to request resources in batches, aligning with pod‑group requirements and preventing hangs.
Calibration efficiency : Replacing expensive groupby.mean on Ray.Dataset with a sort‑plus‑unique approach implemented in Cython reduced calibration time on a 100 M record dataset from 19 minutes to 2 minutes.
Conclusion and outlook : While the Ray ecosystem offers rich capabilities, thorough testing and iterative optimization are required to meet Didi’s production requirements. The Ray‑based XGBoost platform has been successfully deployed, delivering significant performance and stability improvements. Future work will focus on further stabilizing the system, expanding Ray‑native components, and enhancing usability for algorithm engineers.
Didi Tech
Official Didi technology account
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.