How RECom Accelerates Recommendation Model Inference on GPUs

The RECom compiler introduces a subgraph‑parallel fusion technique and symbolic shape handling to dramatically speed up GPU inference of deep recommendation models with massive embedding columns, achieving up to 6.61× lower latency and 1.91× higher throughput than TensorFlow baselines, while eliminating redundant computations.

Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
How RECom Accelerates Recommendation Model Inference on GPUs

Background

Deep learning‑based recommendation models are increasingly critical in large‑scale services. They consist of an embedding layer—often thousands of embedding columns that map features such as user IDs to low‑dimensional vectors—and a deep neural network (DNN). In production, embedding columns dominate GPU inference latency, accounting for over 99% of end‑to‑end delay on Alibaba’s workloads.

Challenges

Existing hand‑written operator libraries cannot cover the combinatorial explosion of possible embedding operators, and they require source‑code access, which conflicts with privacy constraints that mandate optimization from the IR (e.g., TensorFlow GraphDef). Moreover, current ML compilers such as XLA focus on DNN kernels and fail to efficiently handle the massive number of embedding operators, leading to excessive kernel launch overhead and poor GPU utilization.

Dynamic shapes further complicate matters: recommendation models often have tensors whose shapes are unknown at compile time, causing redundant shape‑related computations and preventing many optimization passes.

Finally, a large portion of embedding‑layer computation is redundant (e.g., unnecessary boundary checks), sometimes consuming up to 80% of GPU time.

Breakthrough: The RECom Compiler

RECom is the first end‑to‑end compiler that targets recommendation models. It introduces a subgraph‑parallelism‑driven operator fusion method that merges thousands of embedding operators into a single GPU kernel, eliminating kernel‑launch overhead and exploiting intra‑ and inter‑column parallelism.

To address dynamic shapes, RECom builds symbolic shape expressions for embedding columns, similar to BladeDISC, and reconstructs all shape‑calculation subgraphs into a unified ShapeConstruct operator that depends only on symbolic inputs. This decouples shape computation from tensor computation, allowing redundant shape ops (e.g., unnecessary SparseReshape) to be eliminated.

RECom also includes an embedding‑column subgraph optimizer that removes common redundant computations observed in production models.

Evaluation

On four internal Alibaba recommendation models and two generative models, RECom achieves up to 6.61× lower end‑to‑end latency and 1.91× higher throughput compared with a TensorFlow baseline.

Related Work

In parallel, the Alibaba‑PAI team and the University of Sydney presented MonoNN at OSDI 2024, a monolithic optimizer for static neural networks on GPUs, achieving up to 7.3× speedup over TVM and 5.9× over TensorRT.

Both RECom and MonoNN are open‑source; RECom’s paper and code are available at https://dl.acm.org/doi/10.1145/3623278.3624761 and MonoNN at https://github.com/AlibabaResearch/mononn.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

machine learningcompilerRecommendation SystemsGPU Optimizationembedding columns
Alibaba Cloud Big Data AI Platform
Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.