How RECom Accelerates Recommendation Model Inference on GPUs
The RECom compiler introduces a subgraph‑parallel fusion technique and symbolic shape handling to dramatically speed up GPU inference of deep recommendation models with massive embedding columns, achieving up to 6.61× lower latency and 1.91× higher throughput than TensorFlow baselines, while eliminating redundant computations.
Background
Deep learning‑based recommendation models are increasingly critical in large‑scale services. They consist of an embedding layer—often thousands of embedding columns that map features such as user IDs to low‑dimensional vectors—and a deep neural network (DNN). In production, embedding columns dominate GPU inference latency, accounting for over 99% of end‑to‑end delay on Alibaba’s workloads.
Challenges
Existing hand‑written operator libraries cannot cover the combinatorial explosion of possible embedding operators, and they require source‑code access, which conflicts with privacy constraints that mandate optimization from the IR (e.g., TensorFlow GraphDef). Moreover, current ML compilers such as XLA focus on DNN kernels and fail to efficiently handle the massive number of embedding operators, leading to excessive kernel launch overhead and poor GPU utilization.
Dynamic shapes further complicate matters: recommendation models often have tensors whose shapes are unknown at compile time, causing redundant shape‑related computations and preventing many optimization passes.
Finally, a large portion of embedding‑layer computation is redundant (e.g., unnecessary boundary checks), sometimes consuming up to 80% of GPU time.
Breakthrough: The RECom Compiler
RECom is the first end‑to‑end compiler that targets recommendation models. It introduces a subgraph‑parallelism‑driven operator fusion method that merges thousands of embedding operators into a single GPU kernel, eliminating kernel‑launch overhead and exploiting intra‑ and inter‑column parallelism.
To address dynamic shapes, RECom builds symbolic shape expressions for embedding columns, similar to BladeDISC, and reconstructs all shape‑calculation subgraphs into a unified ShapeConstruct operator that depends only on symbolic inputs. This decouples shape computation from tensor computation, allowing redundant shape ops (e.g., unnecessary SparseReshape) to be eliminated.
RECom also includes an embedding‑column subgraph optimizer that removes common redundant computations observed in production models.
Evaluation
On four internal Alibaba recommendation models and two generative models, RECom achieves up to 6.61× lower end‑to‑end latency and 1.91× higher throughput compared with a TensorFlow baseline.
Related Work
In parallel, the Alibaba‑PAI team and the University of Sydney presented MonoNN at OSDI 2024, a monolithic optimizer for static neural networks on GPUs, achieving up to 7.3× speedup over TVM and 5.9× over TensorRT.
Both RECom and MonoNN are open‑source; RECom’s paper and code are available at https://dl.acm.org/doi/10.1145/3623278.3624761 and MonoNN at https://github.com/AlibabaResearch/mononn.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Big Data AI Platform
The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
