Optimizing Spark‑ML Linear Models with Project Matrix: Background, Progress, and Future Plans
This article introduces the Project Matrix initiative that re‑examines and restructures Spark‑ML linear models, detailing the background of Spark‑ML usage at JD, the performance‑focused optimizations such as blockification and virtual centering, and outlines upcoming work to further improve scalability and accuracy.
Sharing guest: Zheng Ruifeng, Senior Algorithm Expert at JD.
Edited by: Joy, 51job. Platform: DataFunTalk.
Background Introduction
With the rapid rise of machine learning and deep learning, Spark‑ML has become popular for its simplicity and wide adoption in offline batch training scenarios. To make it more efficient for production and serve the community, the Project Matrix was created to revisit and reconstruct Spark‑ML linear models.
1. Spark vs Hadoop
Spark’s in‑memory advantage makes it hundreds of times faster than Hadoop for iterative computations.
2. Main JD Applications of Spark‑ML
Feature engineering using SparkSQL combined with Spark‑ML.
Digital marketing: model marketplace and KA audience targeting, predicting potential users for stores based on 30‑day behavior, using multiple GBDT models with linear weighting, achieving billions of predictions daily.
Offline labeling: durable goods segmentation, fraud detection.
Unsupervised: target group mining with k‑Means clustering.
3. Optimization Focus
Four most used linear models were selected for optimization: Logistic Regression, Linear SVM, Linear Regression, and Survival Regression.
Project Matrix Origin
Since 2015 Spark‑ML saw little change, and practical usage revealed performance and accuracy issues. To address these, a new open‑source project, Project Matrix, was launched to reconstruct and optimize linear model training, gaining broad community support.
Optimization Progress
Expectations are "fast and accurate". The work is divided into two main techniques.
1. Blockification
Vectorization transforms per‑record processing into batch processing. The training pipeline (initialization, optimizer selection, distributed computation) was re‑engineered for Spark 3.1, introducing data scaling, blockifying vectors into 1 MB blocks, and leveraging BLAS libraries for dense linear algebra.
BLAS levels: Level 1 vector ops, Level 2 matrix‑vector, Level 3 matrix‑matrix. Spark‑ML now uses native BLAS (OpenBLAS or Intel MKL) for dense data and a Scala implementation for sparse data.
Performance gains: up to 18× speedup on dense data, 3–6× on sparse data (1 % density).
2. Virtual Centering
Addressing a complaint that Spark‑ML coefficients were less accurate than GLMNET, the team introduced virtual centering: scaling input data without destroying sparsity, and adding a pre‑computation in the forward pass with gradient adjustment in the backward pass to emulate standardization.
Results show model coefficients closer to GLMNET, reduced divergence, and 20‑50 % fewer iterations.
These optimizations have been released in Spark 3.1 and 3.2.
Future Planning Summary
BLAS optimization in collaboration with the OpenJDK team, using JDK 16 Vector API and targeting sparse data.
Adaptive block size based on data sparsity and dimensionality.
Blockify KMeans.
Further model training improvements such as better initialization and early stopping.
Refactor tree models.
Thank you for listening.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.