How Alibaba’s eXtreme Parameter Server Powers Billion‑Scale Machine Learning

Alibaba’s eXtreme Parameter Server (XPS) platform tackles the challenges of training models on billions of samples and trillions of features by employing streaming learning, feature hashing, dynamic sparsity, asynchronous checkpointing, and high‑performance communication, enabling efficient, fault‑tolerant distributed AI at massive scale.

Alibaba Cloud Developer
Alibaba Cloud Developer
Alibaba Cloud Developer
How Alibaba’s eXtreme Parameter Server Powers Billion‑Scale Machine Learning

System Architecture and Data Flow

In 2017 Alibaba’s recommendation team and the PAI computing platform jointly built the eXtreme Parameter Server (XPS) platform, aiming for extreme performance and effectiveness. XPS runs full‑traffic workloads on mobile Taobao’s "Guess You Like", Life Research Institute, Fliggy travel, and Tmall recommendation scenarios.

On Double‑11 2017, hour‑level XNN models were deployed in the Guess You Like and Tmall recommendation scenes, leveraging real‑time user behavior to significantly boost revenue and user value. The platform now processes 100 billion samples and 1 trillion features daily with high speed, fault tolerance, and resource efficiency.

System architecture diagram
System architecture diagram

The data pipeline integrates OSS files, MaxCompute offline storage, Streaming DataHub, and Kafka. Users write SQL on MaxCompute to invoke XPS algorithms, while the Feitian cluster scheduler allocates resources efficiently. XPS supports algorithms such as XNN, XFTRL, XSVD, XGBoost, and FM for recommendation, advertising, and search.

Data flow diagram
Data flow diagram

Distributed Optimizations

To support trillion‑scale parameters and hundred‑billion‑scale features, XPS introduces several optimizations:

Feature Hashing : Directly use hashed feature IDs, eliminating costly feature indexing and reducing memory usage by half while improving performance.

Dynamic Feature Scaling : An ArrayHashMap with manual memory management (realloc/mremap) provides zero‑copy communication and outperforms standard hash maps, handling high‑frequency feature insertions and deletions.

Global Checkpoint & Exactly‑Once Failover : Asynchronous checkpointing and failover allow seamless recovery from node failures without data loss or duplicate computation.

High‑Concurrency Communication : A custom high‑performance communication layer achieves 1–2× speedup over traditional MPI, supporting sparse matrix zero‑copy transfers and reducing latency.

Representation Learning Optimizations : Embedding vectors are stored in contiguous buffers, enabling O(1) access and avoiding dense matrix conversion, which accelerates minibatch embedding calculations.

Core Algorithms

XPS focuses on streaming learning to handle massive data streams efficiently. The main algorithms are:

XFTRL : An enhanced FTRL algorithm with regularization to prevent weight explosion, version‑controlled weight updates, gradient averaging, and parameter decay for better stability on streaming data.

XSVD : A scalable adaptation of SVD++ for e‑commerce, incorporating session features and SLIM‑style product‑pair weighting to remain learnable on trillion‑sample datasets.

XNN : A deep learning model comprising InputLayer, TransformLayer, MultiActiveLayer, and OutputLayer. It processes one‑hot, combined, and continuous features, applies short/long encodings, and uses a mix of Eigen and CBlas for efficient matrix operations.

Additional classic algorithms such as XGBoost, Factorization Machines, OWL‑QN, and Word2Vec are also supported.

Operator System

XPS abstracts common algorithmic functions into reusable operators, enabling rapid algorithm development:

Streaming Evaluation Operator : Periodically evaluates future batches during training, reporting AUC, MAE, RMSE, and weight statistics to accelerate hyper‑parameter tuning.

Automatic Session Feature Operator : Dynamically aggregates short‑, medium‑, and long‑term session features, enriching the feature set with decay coefficients.

Gradient Averaging Operator : Adjusts gradient updates based on feature frequency to mitigate over‑fitting of rare or overly frequent features.

Asynchronous Update Control Operator : Discounts gradients according to version differences between pulled weights and pushed gradients, improving convergence under asynchronous training.

Other operators include activation selection, regularization selection, variable decay, and safety checks.

Developers use a nine‑step SDK workflow to implement new algorithms, focusing on gradient computation, feature engineering, operator selection, and safety validation, while XPS handles distributed scheduling, communication, and fault tolerance.

Since its launch, XPS has been widely adopted across Alibaba’s search, advertising, and recommendation services, delivering significant improvements in click‑through rate, revenue, and user value. Future plans include support for image and NLP models, TensorFlow integration, reinforcement learning frameworks, and the AliBasicFeatureServer.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Alibabadistributed machine learningfeature hashingstreaming learninglarge‑scale AI
Alibaba Cloud Developer
Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.