Artificial Intelligence 20 min read

Ray-native XGBoost Training Platform: Architecture, Performance, and Technical Challenges

Didi’s new Ray‑native XGBoost training platform replaces the fault‑prone Spark solution with a fully Pythonic, fault‑tolerant architecture that leverages Ray’s autoscaling and gang‑scheduling, delivering 2–6× speedups, reduced failure rates, efficient sparse‑vector handling, scalable hyper‑parameter search, and improved resource utilization for large‑scale machine‑learning workloads.

Didi Tech

Jan 25, 2024

Ray-native XGBoost Training Platform: Architecture, Performance, and Technical Challenges

As a core representative of machine learning models, XGBoost plays a crucial role in many of Didi's strategy algorithm scenarios. Ensuring and continuously improving the stability of XGBoost offline training and online inference has been a key focus of the machine learning platform. However, the existing XGBoost‑On‑Spark solution suffers from limited fault tolerance, low training efficiency on large‑scale data, and poor integration with the Python‑centric AI ecosystem.

The main issues of the Spark‑based solution are:

Missing task‑level fault tolerance; failures are frequent under high cluster load.

Complex performance‑tuning parameters for billion‑scale data, making troubleshooting difficult.

Implementation in the JVM ecosystem makes integration with mainstream AI libraries (TensorFlow, PyTorch) cumbersome; most algorithm engineers work in Python.

Stuck on XGBoost 0.82 while the community has moved to 2.0, preventing use of new features and optimizations.

Insufficient hyper‑parameter search capabilities; existing Spark‑based pipelines cannot easily incorporate advanced AutoML methods.

After evaluating various MLOps solutions, the team chose a Ray‑native approach. Ray provides native support for XGBoost, LightGBM, TensorFlow, PyTorch, and offers built‑in fault tolerance, a rich AI ecosystem, and a fully Pythonic interface, lowering the entry barrier for algorithm engineers.

Technical architecture (see the original diagram): the server parses training task parameters, creates a RayJob instance, and watches its status. Each training task launches an isolated Ray cluster with custom containers, achieving pod‑level isolation. Resource scheduling uses Volcano with a gang‑scheduling policy combined with Ray’s autoscaler to request resources dynamically while avoiding deadlocks. Training logs are written to a host‑mounted volume and streamed via a Log Server Daemon.

Performance gains : compared with the original XGBoost‑On‑Spark, Ray‑based training achieves 2‑6× speedup across data sizes ranging from 1 M to 242 M samples, and reduces failure rates from 4.6% to 1.6% under stress tests.

Fault‑tolerance mechanisms :

Non‑elastic training: failed actors are rescheduled and training resumes from the latest checkpoint.

Elastic training: remaining actors continue training while failed actors are restarted; once recovered, they re‑join the job using the latest checkpoint.

Key code snippets illustrating the implementation:

struct CompositeSubColumns {
  std::shared_ptr<Column> type;
  std::shared_ptr<Column> size;
  std::shared_ptr<Column> indices;
  std::shared_ptr<Column> values;
};

template <typename T>
class ListColumn : public Column {
  static constexpr float kNaN = std::numeric_limits<float>::quiet_NaN();
public:
  ListColumn(size_t idx, size_t length, size_t null_count,
             const uint8_t* bitmap, const T* data, float missing,
             const uint32_t* offset)
    : Column{idx, length, null_count, bitmap}, data_{data}, missing_{missing}, offset_{offset} {}
  std::vector<COOTuple> GetElements(size_t row_idx) const override {
    if (IsValidElement(row_idx)) {
      auto slot_idx = offset_[row_idx];
      auto slot_length = offset_[row_idx + 1] - slot_idx;
      std::vector<COOTuple> fv; fv.reserve(slot_length);
      for (size_t i = 0; i < slot_length; ++i) {
        float val = static_cast<float>(data_[slot_idx + i]);
        fv.emplace_back(COOTuple{row_idx, i, val});
      }
      return fv;
    }
    return {};
  }
private:
  const T* data_;
  float missing_;
  const uint32_t* offset_;
};

Hyper‑parameter search is handled by Ray Tune, which integrates seamlessly with the Ray cluster. It supports Bayesian optimization, genetic algorithms, and is compatible with external frameworks such as Optuna and FLAML, enabling efficient large‑scale AutoML.

Technical challenges and solutions :

Spark sparse vector support : To avoid costly conversions (PyArrow → Pandas → scipy.sparse → DMatrix), the team added direct Arrow‑to‑DMatrix handling for high‑dimensional sparse features, defining custom column types (CompositeSubColumns, ListColumn, FixedListColumn).

CPU multi‑core utilization : The original DMatrix constructor ignored the nthread parameter, preventing multi‑core usage. The fix sets ctx_.nthread = nthread and adds an OpenMP pragma for parallel import.

OOM due to duplicate DMatrix loading : The evaluation dataset previously re‑loaded the training set, causing memory blow‑up. The fix checks for duplication and re‑uses the existing DMatrix.

Resource contention and cluster deadlock : Ray’s autoscaler requests resources one‑by‑one, violating Volcano’s gang‑scheduling “all‑or‑nothing” rule. The team modified the autoscaler to request resources in batches, aligning with pod‑group requirements and preventing hangs.

Calibration efficiency : Replacing expensive groupby.mean on Ray.Dataset with a sort‑plus‑unique approach implemented in Cython reduced calibration time on a 100 M record dataset from 19 minutes to 2 minutes.

Conclusion and outlook : While the Ray ecosystem offers rich capabilities, thorough testing and iterative optimization are required to meet Didi’s production requirements. The Ray‑based XGBoost platform has been successfully deployed, delivering significant performance and stability improvements. Future work will focus on further stabilizing the system, expanding Ray‑native components, and enhancing usability for algorithm engineers.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

MLOps distributed machine learning hyperparameter optimization Ray XGBoost

Written by

Didi Tech

Official Didi technology account

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.