Artificial Intelligence 14 min read

PICASSO: An Industrial-Scale Sparse Training Engine for Wide-and-Deep Recommender Systems

PICASSO, Alibaba’s GPU‑centric sparse training engine for wide‑and‑deep recommender systems, merges identical embedding tables, interleaves data and kernel operations, and caches hot embeddings on GPU, eliminating the parameter server and delivering up to tenfold speedups over TensorFlow‑PS while maintaining model quality.

Alimama Tech
Alimama Tech
Alimama Tech
PICASSO: An Industrial-Scale Sparse Training Engine for Wide-and-Deep Recommender Systems

Recently, Alibaba's self‑developed sparse training engine paper "PICASSO: Unleashing the Potential of GPU‑centric Training for Wide‑and‑deep Recommender Systems" was accepted at the top‑tier data‑engineering conference ICDE 2022. PICASSO (Packing, Interleaving and Caching Augmented Software System Optimization) is the result of close collaboration between the XDL team of Alibaba's Intelligent Engine Business Unit and the PAI team of Alibaba Cloud. Internally, PICASSO powers three products—XDL2, PAI‑TensorFlow and PAI‑HybridBackend—serving search, recommendation and advertising workloads.

The article explains the motivation: as model complexity and data volume grow rapidly, sparse models for advertising and search suffer from low resource utilization on generic hardware, even with many dense‑ and sparse‑specific optimizations. PICASSO explores methods to raise hardware efficiency while meeting strict business‑quality requirements.

Key technical designs include:

Packing : embedding tables with identical attributes (dimension, initializer, feature group, etc.) are merged into a single larger table, and their lookup operators are combined into a big operator. This reduces kernel‑launch overhead and improves memory‑access patterns while keeping the original code semantics transparent to users.

Interleaving : two orthogonal techniques—(1) data‑interleaving splits a training batch and pipelines operators that use different hardware resources, alleviating resource bottlenecks (e.g., reducing peak GPU memory usage in the MLP stage); (2) kernel‑interleaving overlaps communication‑intensive shuffle with memory‑intensive gather inside the embedding layer, increasing overall resource utilization.

Caching : exploiting the skewed frequency of IDs, hot embeddings are placed in GPU memory while cold embeddings reside in CPU memory. Two hash tables are kept synchronized and refreshed periodically, dramatically cutting redundant low‑speed memory accesses without sacrificing gradient accuracy.

System architecture : PICASSO removes the traditional Parameter Server role. Each worker reads a data shard, holds a partition of the embedding tables (partitioned by ID, dimension, or table), and maintains a full replica of dense parameters. Gradients are aggregated across workers and applied globally, allowing efficient use of high‑speed interconnects (e.g., RDMA) and ensuring consistency of dense parameters.

Benchmark results : Using the public Criteo dataset and four representative models (DLRM, DeepFM, DIN, DIEN), PICASSO achieves 1.9×–10× speedup over TensorFlow‑PS (baseline) and at least 2× improvement over PyTorch model‑parallel training. Internal Alibaba experiments on the CAN model further show substantial reductions in training time and other metrics.

Future outlook : The PICASSO team is working on automating parameter tuning, extending optimizations to a broader range of sparse scenarios, and inviting researchers and engineers to collaborate on advancing sparse training efficiency for the entire machine‑learning community.

Alibabamachine learningrecommender systemsdistributed trainingGPU optimizationSparse Training
Alimama Tech
Written by

Alimama Tech

Official Alimama tech channel, showcasing all of Alimama's technical innovations.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.