Artificial Intelligence 9 min read

Optimizing Spark‑ML Linear Models with Project Matrix: Background, Progress, and Future Plans

This article introduces the Project Matrix initiative that re‑examines and restructures Spark‑ML linear models, detailing the background of Spark‑ML usage at JD, the performance‑focused optimizations such as blockification and virtual centering, and outlines upcoming work to further improve scalability and accuracy.

DataFunTalk

Dec 25, 2021

Optimizing Spark‑ML Linear Models with Project Matrix: Background, Progress, and Future Plans

Sharing guest: Zheng Ruifeng, Senior Algorithm Expert at JD.

Edited by: Joy, 51job. Platform: DataFunTalk.

Background Introduction

With the rapid rise of machine learning and deep learning, Spark‑ML has become popular for its simplicity and wide adoption in offline batch training scenarios. To make it more efficient for production and serve the community, the Project Matrix was created to revisit and reconstruct Spark‑ML linear models.

1. Spark vs Hadoop

Spark’s in‑memory advantage makes it hundreds of times faster than Hadoop for iterative computations.

2. Main JD Applications of Spark‑ML

Feature engineering using SparkSQL combined with Spark‑ML.

Digital marketing: model marketplace and KA audience targeting, predicting potential users for stores based on 30‑day behavior, using multiple GBDT models with linear weighting, achieving billions of predictions daily.

Offline labeling: durable goods segmentation, fraud detection.

Unsupervised: target group mining with k‑Means clustering.

3. Optimization Focus

Four most used linear models were selected for optimization: Logistic Regression, Linear SVM, Linear Regression, and Survival Regression.

Project Matrix Origin

Since 2015 Spark‑ML saw little change, and practical usage revealed performance and accuracy issues. To address these, a new open‑source project, Project Matrix, was launched to reconstruct and optimize linear model training, gaining broad community support.

Optimization Progress

Expectations are "fast and accurate". The work is divided into two main techniques.

1. Blockification

Vectorization transforms per‑record processing into batch processing. The training pipeline (initialization, optimizer selection, distributed computation) was re‑engineered for Spark 3.1, introducing data scaling, blockifying vectors into 1 MB blocks, and leveraging BLAS libraries for dense linear algebra.

BLAS levels: Level 1 vector ops, Level 2 matrix‑vector, Level 3 matrix‑matrix. Spark‑ML now uses native BLAS (OpenBLAS or Intel MKL) for dense data and a Scala implementation for sparse data.

Performance gains: up to 18× speedup on dense data, 3–6× on sparse data (1 % density).

2. Virtual Centering

Addressing a complaint that Spark‑ML coefficients were less accurate than GLMNET, the team introduced virtual centering: scaling input data without destroying sparsity, and adding a pre‑computation in the forward pass with gradient adjustment in the backward pass to emulate standardization.

Results show model coefficients closer to GLMNET, reduced divergence, and 20‑50 % fewer iterations.

These optimizations have been released in Spark 3.1 and 3.2.

Future Planning Summary

BLAS optimization in collaboration with the OpenJDK team, using JDK 16 Vector API and targeting sparse data.

Adaptive block size based on data sparsity and dimensionality.

Blockify KMeans.

Further model training improvements such as better initialization and early stopping.

Refactor tree models.

Thank you for listening.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization Big Data Spark linear models

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.