Artificial Intelligence 24 min read

How RecIS Revolutionizes Large‑Scale Sparse‑Dense Recommendation Training

RecIS is an open‑source, PyTorch‑based unified framework designed for ultra‑large‑scale sparse‑dense computation in recommendation systems, offering a full solution for training models with massive samples, multimodal inputs, and large embeddings, and demonstrating significant performance gains over TensorFlow and TorchRec in production deployments.

Alimama Tech

Oct 1, 2025

How RecIS Revolutionizes Large‑Scale Sparse‑Dense Recommendation Training

Summary

RecIS is a unified, open‑source deep‑learning framework built on the PyTorch ecosystem for ultra‑large‑scale sparse‑dense computation in recommendation systems. Developed jointly by iOrange Technology, Taobao Group, and Alibaba Mama, it provides a complete solution for training recommendation models and multimodal or large‑model workloads, and has been widely adopted in Alibaba advertising, recommendation, and search scenarios.

1. Background and Goals

Modern recommendation systems are shifting from traditional MLPs to Transformer‑based architectures, driven by the need to handle longer user behavior sequences and larger training datasets. This shift creates a sparse‑dense mixed paradigm where sparse parts handle massive feature and embedding calculations, while dense parts process the resulting embeddings with deep neural networks.

Existing industrial frameworks (TensorFlow for sparse, PyTorch for dense) each have limitations: TensorFlow offers mature sparse optimizations but poor usability, while PyTorch lacks robust large‑scale sparse support. RecIS aims to bridge this gap with a unified architecture.

2. System Design

2.1 Algorithm Framework

RecIS migrates training components from XDL (TensorFlow) to maintain compatibility with existing models and hyper‑parameter configurations.

ColumnIO

Handles distributed data reading and preprocessing, supporting basic types (Double, Float, Bigint, String) and nested structures (List). It reads column‑oriented samples efficiently from distributed file systems and assembles them into PyTorch tensors.

Feature Engine

Feature conversion: hash string features to IDs.

Feature discretization: bucketize numeric features.

Sequence processing: truncate sequences.

Feature crossing: combine ID features from different columns.

Embedding Engine

Manages embedding tables using a conflict‑free, scalable KV store (Hashtable) that supports dynamic eviction and efficient updates.

Optimizer

Implements SparseAdam and SparseAdamW, compatible with TensorFlow optimizers.

Saver

Stores sparse model parameters in SafeTensors format with parallel sharding to improve read/write performance.

Pipelines

Orchestrates the above components, supporting multi‑stage training, windowed online learning, and multi‑objective workflows.

3. System Optimization

3.1 Three Performance Walls

IO Wall : Massive daily sample volumes (hundreds of billions) require high‑throughput data loading.

Memory Wall : Sparse embedding lookups are memory‑bound; performance measured by Model Bandwidth Utilization (MBU).

Computation Wall : Dense parts are compute‑bound; performance measured by Model FLOPS Utilization (MFU).

3.2 IO Optimizations

Columnar storage layout reduces copy overhead and improves compression.

High‑concurrency C++ multithreaded readers and asynchronous pipelines hide IO latency.

CSR memory layout replaces TensorFlow’s COO for efficient long‑sequence handling.

GPU‑side batch assembly leverages GPU memory bandwidth.

3.3 Memory Optimizations

RecIS adopts a two‑level storage architecture:

IDMap : Feature ID → offset mapping.

Blocks : Continuous memory blocks storing embedding vectors and optimizer states, supporting dynamic sharding.

Both levels can reside in GPU memory for high‑bandwidth access, or fall back to CPU memory when GPU memory is limited.

3.4 Load Balancing

Features are aggregated and fully sharded across GPUs using All‑to‑All communication, ensuring uniform distribution of sparse parameters.

3.5 Maximizing MBU

Sparse computation concurrency: kernel fusion reduces Python call overhead and improves GPU scheduling.

Embedding merging: automatically combines same‑dimensional embeddings to reduce memory accesses.

Vectorized memory accesses and warp‑level reductions lower atomic‑operation collisions.

4. Large‑Model Optimizations

Mixed‑precision training (FP32 for sparse, FP16/BF16 for dense attention).

Fused kernels such as FlashAttention and fused softmax‑cross‑entropy.

ZeRO partitioning of dense parameters, gradients, and optimizer states to reduce GPU memory usage.

5. Performance Evaluation

Operator‑level benchmarks on H20 GPUs show RecIS achieving higher MBU percentages than TensorFlow and PyTorch for bucketize, mod, ids‑partition, sequence‑tile, reduce‑hard/easy, gather, and scatter operations.

End‑to‑end experiments on two models:

MSE search model : RecIS reduces overall training time to 33% of TensorFlow and 80% of PyTorch; sparse part to 30% of TensorFlow.

LMA advertising model : RecIS outperforms TensorFlow (which fails at 100k sequence length) with overall latency improvements of 27% (16k) and 27% (100k) compared to the TensorFlow baseline.

6. Production Applications

6.1 Scaling Dense Parameters

RecIS enables generative ranking models with 50 M dense parameters, achieving 200% larger batch sizes and 70% shorter training cycles compared to TorchRec.

6.2 Scaling User Sequences

Supports user behavior sequences up to 1 M length, overcoming IO bandwidth and embedding lookup challenges via ORC compression, GPU‑unique, and GPU‑hashtable techniques, yielding a 4.8% CTR lift and 50% cost reduction.

6.3 Scaling Modality

Introduces modality scaling by combining a universal base model with lightweight task‑specific experts, using target‑aware embeddings to fuse sparse IDs, images, and text, delivering measurable online gains.

RecIS is available on GitHub (https://github.com/alibaba/recis) and described in a technical report (http://arxiv.org/abs/2509.20883).

recommendation systems PyTorch deep learning framework Large-Scale AI sparse-dense training