How EasyRec Boosts Recommendation Performance: Training, Inference, and Online Learning Optimizations

This article explains the EasyRec recommendation system's training and inference architecture, details a series of optimizations for both CPU and GPU pipelines, and describes the online learning workflow that enables real‑time model updates across large‑scale e‑commerce scenarios.

DataFunSummit
DataFunSummit
DataFunSummit
How EasyRec Boosts Recommendation Performance: Training, Inference, and Online Learning Optimizations

EasyRec Training and Inference Architecture

Recent recommendation models have grown dramatically in feature count, embedding size, sequence length, and dense‑layer complexity, leading to severe compute and latency challenges. EasyRec addresses these issues with a configurable, component‑based architecture that runs on MaxCompute, EMR, and the DLC container platform.

Core Components

Data layer, Embedding layer, Dense layer, Output layer

Support for Keras‑style custom components and automatic hyper‑parameter tuning via NNI

Large‑scale distributed training, online‑learning (ODL), and work‑queue based checkpoint recovery

Extended TensorFlow distributed evaluator for massive data evaluation

PAI‑REC Inference Engine

The PAI‑REC engine, written in Go, links the recommendation pipeline stages (recall, ranking, re‑ranking, shuffling) and provides a modular, high‑performance interface for A/B testing and feature‑consistency diagnostics.

EasyRecProcessor

EasyRecProcessor handles online inference for recall and ranking models. It consists of an item feature cache, a feature generator, and a TensorFlow model, with extensive CPU/GPU optimizations such as feature caching, incremental model updates, and efficient embedding lookup.

Training Optimizations

SequenceFeature deduplication reduces duplicate item sequences in a batch, cutting the effective batch size to 5‑10% of the original and improving throughput by ~20%.

EmbeddingParallel replaces the PS‑Worker pattern with a hybrid approach: dense parameters are synchronized via All‑Reduce, while sparse embeddings are sharded across workers, eliminating PS communication bottlenecks.

On CPUs, a lock‑free hash table from DeepRec outperforms Google’s dense hash table; on GPUs, HugeCTR’s sok embedding caches hot embeddings to reduce H2D transfer.

Matrix multiplication (MatMul) dominates CPU compute (>60%). By leveraging Intel AMX BF16 acceleration, MatMul speed increases ~16×, dramatically shortening training time.

Inference Optimizations

Embedding operators on CPU suffer from many small kernels (unique, SparseSegmentMean). Fusion and AVX parallelism collapse hundreds of tiny ops into a single kernel, cutting operator count by >50% and halving response time.

BF16 quantization reduces memory usage with negligible AUC impact; AVX‑accelerated BF16‑to‑float conversion further improves QPS and latency.

Feature‑layer improvements replace MurmurHash with AVX‑based CrcHash/XorHash, lowering request latency by >5%.

SequenceFeature storage is compacted, shrinking memory footprint by >80% while preserving throughput.

GPU Placement and XLA/TF‑TRT Fusion

Embedding lookup is kept on CPU, dense computation on GPU. A Min‑Cut graph algorithm determines the optimal split point, reducing H2D memcpy overhead.

For compute‑intensive kernels (MatMul, elementwise ops), XLA fuses operators to reduce kernel launch cost; dynamic‑shape padding mitigates recompilation overhead.

TensorRT (TRT) further fuses BatchNorm, Add, ReLU, and supports BF16 quantization, delivering additional QPS gains despite its closed‑source nature.

Online Learning (Real‑Time Updates)

Online learning updates embeddings and dense parameters in response to new items or traffic spikes. Logs flow back through PAI‑REC to SLS, then to Datahub where Flink aggregates samples and labels for streaming training.

Trained increments are stored in OSS and synced to the EasyRec Processor. Feature consistency is ensured via feature‑embedding points and LZ4‑compressed joins.

Data cleaning removes delayed or duplicate callbacks; delayed positive samples are corrected before training, improving performance in new‑item and content scenarios.

Network and Request Compression

Direct pod‑IP connections replace Nginx load‑balancing, shaving ~5 ms off latency.

High‑throughput links use Snappy/ZSTD compression, cutting 10 Gbps traffic by a factor of five while keeping latency low.

These optimizations collectively improve recommendation accuracy, reduce cost, and enable scalable real‑time learning across diverse Alibaba Cloud services.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

recommendationAIInference OptimizationOnline LearningTraining Optimization
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.