How RecIS Revolutionizes Large‑Scale Sparse‑Dense Recommendation Training
RecIS is an open‑source, PyTorch‑based unified framework designed for ultra‑large‑scale sparse‑dense computation in recommendation systems, offering a full solution for training models with massive samples, multimodal inputs, and large embeddings, and demonstrating significant performance gains over TensorFlow and TorchRec in production deployments.
Summary
RecIS is a unified, open‑source deep‑learning framework built on the PyTorch ecosystem for ultra‑large‑scale sparse‑dense computation in recommendation systems. Developed jointly by iOrange Technology, Taobao Group, and Alibaba Mama, it provides a complete solution for training recommendation models and multimodal or large‑model workloads, and has been widely adopted in Alibaba advertising, recommendation, and search scenarios.
1. Background and Goals
Modern recommendation systems are shifting from traditional MLPs to Transformer‑based architectures, driven by the need to handle longer user behavior sequences and larger training datasets. This shift creates a sparse‑dense mixed paradigm where sparse parts handle massive feature and embedding calculations, while dense parts process the resulting embeddings with deep neural networks.
Existing industrial frameworks (TensorFlow for sparse, PyTorch for dense) each have limitations: TensorFlow offers mature sparse optimizations but poor usability, while PyTorch lacks robust large‑scale sparse support. RecIS aims to bridge this gap with a unified architecture.
2. System Design
2.1 Algorithm Framework
RecIS migrates training components from XDL (TensorFlow) to maintain compatibility with existing models and hyper‑parameter configurations.
ColumnIO
Handles distributed data reading and preprocessing, supporting basic types (Double, Float, Bigint, String) and nested structures (List). It reads column‑oriented samples efficiently from distributed file systems and assembles them into PyTorch tensors.
Feature Engine
Feature conversion: hash string features to IDs.
Feature discretization: bucketize numeric features.
Sequence processing: truncate sequences.
Feature crossing: combine ID features from different columns.
Embedding Engine
Manages embedding tables using a conflict‑free, scalable KV store (Hashtable) that supports dynamic eviction and efficient updates.
Optimizer
Implements SparseAdam and SparseAdamW, compatible with TensorFlow optimizers.
Saver
Stores sparse model parameters in SafeTensors format with parallel sharding to improve read/write performance.
Pipelines
Orchestrates the above components, supporting multi‑stage training, windowed online learning, and multi‑objective workflows.
3. System Optimization
3.1 Three Performance Walls
IO Wall : Massive daily sample volumes (hundreds of billions) require high‑throughput data loading.
Memory Wall : Sparse embedding lookups are memory‑bound; performance measured by Model Bandwidth Utilization (MBU).
Computation Wall : Dense parts are compute‑bound; performance measured by Model FLOPS Utilization (MFU).
3.2 IO Optimizations
Columnar storage layout reduces copy overhead and improves compression.
High‑concurrency C++ multithreaded readers and asynchronous pipelines hide IO latency.
CSR memory layout replaces TensorFlow’s COO for efficient long‑sequence handling.
GPU‑side batch assembly leverages GPU memory bandwidth.
3.3 Memory Optimizations
RecIS adopts a two‑level storage architecture:
IDMap : Feature ID → offset mapping.
Blocks : Continuous memory blocks storing embedding vectors and optimizer states, supporting dynamic sharding.
Both levels can reside in GPU memory for high‑bandwidth access, or fall back to CPU memory when GPU memory is limited.
3.4 Load Balancing
Features are aggregated and fully sharded across GPUs using All‑to‑All communication, ensuring uniform distribution of sparse parameters.
3.5 Maximizing MBU
Sparse computation concurrency: kernel fusion reduces Python call overhead and improves GPU scheduling.
Embedding merging: automatically combines same‑dimensional embeddings to reduce memory accesses.
Vectorized memory accesses and warp‑level reductions lower atomic‑operation collisions.
4. Large‑Model Optimizations
Mixed‑precision training (FP32 for sparse, FP16/BF16 for dense attention).
Fused kernels such as FlashAttention and fused softmax‑cross‑entropy.
ZeRO partitioning of dense parameters, gradients, and optimizer states to reduce GPU memory usage.
5. Performance Evaluation
Operator‑level benchmarks on H20 GPUs show RecIS achieving higher MBU percentages than TensorFlow and PyTorch for bucketize, mod, ids‑partition, sequence‑tile, reduce‑hard/easy, gather, and scatter operations.
End‑to‑end experiments on two models:
MSE search model : RecIS reduces overall training time to 33% of TensorFlow and 80% of PyTorch; sparse part to 30% of TensorFlow.
LMA advertising model : RecIS outperforms TensorFlow (which fails at 100k sequence length) with overall latency improvements of 27% (16k) and 27% (100k) compared to the TensorFlow baseline.
6. Production Applications
6.1 Scaling Dense Parameters
RecIS enables generative ranking models with 50 M dense parameters, achieving 200% larger batch sizes and 70% shorter training cycles compared to TorchRec.
6.2 Scaling User Sequences
Supports user behavior sequences up to 1 M length, overcoming IO bandwidth and embedding lookup challenges via ORC compression, GPU‑unique, and GPU‑hashtable techniques, yielding a 4.8% CTR lift and 50% cost reduction.
6.3 Scaling Modality
Introduces modality scaling by combining a universal base model with lightweight task‑specific experts, using target‑aware embeddings to fuse sparse IDs, images, and text, delivering measurable online gains.
RecIS is available on GitHub (https://github.com/alibaba/recis) and described in a technical report (http://arxiv.org/abs/2509.20883).
Alimama Tech
Official Alimama tech channel, showcasing all of Alimama's technical innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
