Big Data 13 min read

Inconsistent Predictions in XGBoost on Spark Due to Different Missing Value Handling

The discrepancy between XGBoost’s Java engine and Spark arose because XGBoost4j treats zero as the default missing value while Spark’s sparse vectors use NaN, causing inconsistent predictions, and was resolved by explicitly setting Float.NaN as the missing value or converting sparse vectors to dense so both engines handle zeros uniformly.

Meituan Technology Team

Aug 15, 2019

Inconsistent Predictions in XGBoost on Spark Due to Different Missing Value Handling

Background: XGBoost is a widely used "killer" algorithm in machine learning competitions and production. XGBoost on Spark provides a distributed training implementation, but its official version has an instability caused by missing‑value handling and Spark’s sparse representation.

Problem: In Meituan’s internal ML platform, the same XGBoost model and test data produced different results when invoked locally with the Java engine versus the Spark engine. The local Java engine returned 333.67892, while the Spark platform returned 328.1694030761719.

Investigation steps:

Checked whether input field types or precision differed – they were identical.

Verified that hyper‑parameters of XGBoostClassifier/XGBoostRegressor were the same – no special handling was found.

Discovered that XGBoost4j treats 0.0f as the default missing value when constructing a DMatrix, while XGBoost on Spark uses Float.NaN as the default missing value.

double[] input = new double[]{1, 2, 5, 0, 0, 6.666666666666667, 31.14, 29.28, 0, 1.303333, 2.8555, 2.37, 701, 463, 3.989, 3.85, 14400.5, 15.79, 11.45, 0.915, 7.05, 5.5, 0.023333, 0.0365, 0.0275, 0.123333, 0.4645, 0.12, 15.082, 14.48, 0, 31.8425, 29.1, 7.7325, 3, 5.88, 1.08, 0, 0, 0, 32};
float[] testInput = new float[input.length];
for (int i = 0, total = input.length; i < total; i++) {
  testInput[i] = new Double(input[i]).floatValue();
}
Booster booster = XGBoost.loadModel("${model}");
DMatrix testMat = new DMatrix(testInput, 1, 41); // missing defaults to 0.0f
float[][] predicts = booster.predict(testMat);

Spark ML stores feature vectors as either DenseVector or SparseVector. SparseVector omits zero entries, and XGBoost on Spark treats those omitted zeros as missing values. Consequently, rows stored as dense use NaN as missing, while rows stored as sparse treat both NaN and 0 as missing, leading to inconsistent predictions.

private[feature] def assemble(vv: Any*): Vector = {
  val indices = ArrayBuilder.make[Int]
  val values = ArrayBuilder.make[Double]
  var cur = 0
  vv.foreach {
    case v: Double =>
      // 0 is not saved
      if (v != 0.0) { indices += cur; values += v }
      cur += 1
    case vec: Vector =>
      vec.foreachActive { case (i, v) =>
        // 0 is not saved
        if (v != 0.0) { indices += cur + i; values += v }
      }
      cur += vec.size
    case null => throw new SparkException("Values to assemble cannot be null.")
    case o => throw new SparkException(s"$o of type ${o.getClass.getName} is not supported.")
  }
  Vectors.sparse(cur, indices.result(), values.result()).compressed
}

Solution:

Explicitly set the missing value to Float.NaN when constructing the Java DMatrix so that both engines treat zeros the same way.

Modify the Spark pipeline to convert SparseVector to dense before feeding it to XGBoost, ensuring a uniform missing‑value definition.

// Modified Java DMatrix creation
DMatrix testMat = new DMatrix(testInput, 1, 41, Float.NaN);

// Modified Spark code to handle SparseVector
val values = features match {
  case v: SparseVector => v.toArray.map(_.toFloat)
  case v: DenseVector  => v.values.map(_.toFloat)
}
XGBLabeledPoint(label, null, values, baseMargin = baseMargin, weight = weight);

After applying these changes, the predictions from the Java engine and Spark engine became identical (328.1694), and the model’s evaluation metrics even improved slightly.

The article shares the debugging process and the fix, hoping to help engineers encountering missing‑value issues with XGBoost on Spark.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Data Engineering XGBoost Spark missing values SparseVector

Written by

Meituan Technology Team

Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.