How EasyRec Boosts Recommendation Training and Inference Performance
This article explains the EasyRec recommendation system’s training and inference architecture, detailing optimization techniques such as embedding parallelism, CPU/GPU placement, XLA and TRT fusion, online learning pipelines, network compression, and real‑world deployment results that dramatically improve throughput and latency.
EasyRec Training and Inference Architecture
EasyRec provides a modular, configurable recommendation pipeline that consists of a data layer, an embedding layer, a dense layer and an output layer. The framework runs on various platforms including MaxCompute, EMR and the Alibaba Cloud DLC container platform, and supports Keras components, distributed training, online learning (ODL), automatic hyper‑parameter tuning via NNI, multi‑optimizer settings, feature hot‑start, large‑scale negative sampling, and checkpoint‑based recovery.
Training Optimizations
To cope with the growing number of features and large embeddings, EasyRec applies several optimizations. SequenceFeature deduplication reduces the number of items processed in a batch to 5‑10% of the original, increasing throughput by about 20%. EmbeddingParallel splits sparse parameters across workers while keeping dense parameters in an All‑Reduce fashion, eliminating the PS communication bottleneck. CPU‑side lock‑free hash tables (DeepRec) and GPU‑side HugeCTR cache‑based embeddings further accelerate look‑ups. Intel AMX BF16 acceleration boosts matrix‑multiply performance by roughly 16×. XLA and TensorRT (TRT) are used for operator fusion and quantization, achieving additional speed‑ups, while batch‑mode processing groups small batches into larger ones for GPU execution.
Inference Optimizations
The PAI‑REC engine, written in Go, orchestrates the end‑to‑end recommendation flow (recall, ranking, re‑ranking, shuffling) and offers a user‑friendly UI for A/B testing and feature‑consistency diagnostics. EasyRecProcessor handles online inference for recall and ranking models, employing item feature caches, a feature generator, and a TensorFlow model. CPU/GPU placement strategies keep lightweight embedding ops on CPU and dense computations on GPU, while Min‑Cut graph partitioning minimizes H2D memory copies. XLA fusion, TRT acceleration, and the upcoming Blade‑DISC compiler address dynamic‑shape challenges. Network‑direct connections and request‑compression (Snappy/ZSTD) reduce latency and bandwidth consumption.
Online Learning
EasyRec supports real‑time model updates through an online‑learning pipeline. Logs and features are streamed from PAI‑REC to SLS, then to DataHub via Flink, where samples are aggregated, labeled, and stored for incremental training. Trained incremental parameters are periodically uploaded to OSS and synchronized to the EasyRecProcessor. Feature consistency is ensured through embedding‑point logging, and LZ4 compression speeds up feature joins. The system also filters abnormal or duplicate data and corrects delayed positive samples, delivering significant gains in cold‑start and hot‑item scenarios.
Overall, the combination of architecture‑level configurability, low‑level operator optimizations, and end‑to‑end online learning enables EasyRec to serve hundreds of Alibaba Cloud customers across e‑commerce, live streaming, content sharing, advertising and community domains with markedly lower cost and higher recommendation quality.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
