Artificial Intelligence 15 min read

EasyRec Deep Dive: Training & Inference Architecture, Optimizations, and Online Learning

This article explains EasyRec's end‑to‑end recommendation system, covering its training‑inference architecture, a series of CPU/GPU and distributed optimizations, and a real‑time online‑learning pipeline that together improve throughput, latency, and model freshness.

DataFunSummit
DataFunSummit
DataFunSummit
EasyRec Deep Dive: Training & Inference Architecture, Optimizations, and Online Learning

01 EasyRec Training & Inference Architecture

Recommendation models now handle thousands of features, large embeddings, and deep dense layers, creating severe compute and latency challenges. EasyRec addresses these by providing a configurable, component‑based architecture consisting of a data layer, embedding layer, dense layer, and output layer. The framework runs on MaxCompute, EMR, and the DLC container platform, supports Keras components, distributed training, online‑learning (ODL), and NNI‑driven hyper‑parameter search. It also offers multi‑optimizer settings, feature hot‑start, large‑scale negative sampling, and a work‑queue mechanism for fault‑tolerant training resume.

EasyRec architecture diagram
EasyRec architecture diagram

02 EasyRec Training Optimization

Key optimizations include SequenceFeature deduplication (reducing batch size to 5‑10% of the original), embedding sharding (EmbeddingParallel) that moves sparse parameters to workers while keeping dense parameters synchronized via All‑Reduce, and lock‑free hash tables on CPU that outperform Google’s dense hash table. On GPU, Hugectr’s Sok embedding caches hot embeddings to cut H2D transfer. Intel AMX BF16 acceleration boosts matrix‑multiply performance by ~16×, and further gains are expected from a C++ implementation of the deduplication logic.

Training optimization diagram
Training optimization diagram

Feature‑layer improvements use AVX‑accelerated StringSplit, replace MurmurHash with CrcHash/XorHash, and introduce a compact storage format for SequenceFeature that cuts memory usage by over 80 %. TensorFlow ops are wrapped to enable parallel execution and overlap of feature generation with embedding lookup, reducing runtime by ~20 %.

03 EasyRec Inference Optimization

The PAI‑REC inference engine, written in Go, connects recall, ranking, re‑ranking, and shuffling stages and provides a user‑friendly UI for A/B testing and feature‑consistency diagnostics. EasyRecProcessor handles online inference through an item feature cache, a feature generator, and a TensorFlow model, applying CPU/GPU optimizations such as feature‑cache reduction, incremental model updates, and GPU‑side dense computation.

Inference optimization diagram
Inference optimization diagram

Inference speed is further improved by fusing small embedding‑related ops, using AVX for parallel execution, and applying BF16 quantization with negligible AUC impact. XLA and TensorRT (TRT) are combined to fuse dense‑layer ops, handle dynamic shapes, and enable BF16 quantization, yielding 10‑30 % QPS gains. Placement optimization moves embedding lookup to CPU and dense computation to GPU while minimizing H2D copies via a min‑cut graph partition.

04 Real‑Time Online Learning

Online learning is realized by streaming logs from PAI‑REC to SLS, then to Datahub, where Flink aggregates samples and labels. The pipeline supports configurable stream training, incremental parameter export to OSS, and automatic processor updates. Feature consistency is enhanced with LZ4‑compressed joins, and delayed or duplicate samples are filtered. The system has demonstrated significant effectiveness in new‑item and content‑driven scenarios.

Online learning pipeline diagram
Online learning pipeline diagram

Additional engineering improvements include direct pod‑IP connections to eliminate an extra Nginx hop (reducing RTT by ~5 ms) and request‑compression techniques (snappy, zstd) that cut high‑throughput traffic by up to fivefold.

inference optimizationrecommendation systemsonline learningtraining optimizationdistributed computingAI infrastructure
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.