Artificial Intelligence 15 min read

How DeepRec Supercharges Weibo’s Hot Recommendation Engine

This article explains the architecture of Weibo's popular recommendation system, the role of the weidl online learning platform, and how DeepRec’s performance optimizations—such as oneDNN operator acceleration, cost‑aware scheduling, and adaptive memory allocation—significantly improve training speed, inference latency, and overall service throughput.

Alibaba Cloud Big Data AI Platform

Dec 22, 2022

How DeepRec Supercharges Weibo’s Hot Recommendation Engine

1. Project Background

Popular Weibo is a core feature of Sina Weibo, covering hot streams, channel streams, short videos and other recommendation scenarios. The weidl machine‑learning framework provides model training and inference services for online learning, and its inference performance is a key optimization target.

DeepRec, Alibaba’s training/prediction engine for search, recommendation and advertising, offers deep performance optimizations for sparse models and rich embedding functionalities.

2. Popular Weibo Recommendation System and weidl Online Learning Platform

2.1 Overall Architecture

The system consists of front‑end business interfaces and the weidl online learning platform. The platform integrates sample stitching, model training, parameter servers and model serving, enabling a complete recommendation pipeline for rapid deployment of new services.

2.2 weidl Online Learning Platform

The platform handles recall, coarse‑ranking and fine‑ranking. Recall gathers candidates from millions of items via multiple strategies; coarse‑ranking scores them using offline‑generated item features and real‑time user features; fine‑ranking applies complex multi‑task deep models and rule‑based re‑ranking to select the final posts.

The inference layer uses a bridge pattern supporting backends such as DeepRec, TensorFlow, Torch and TensorRT on both CPU and GPU.

2.2.1 Real‑time Requirements

Real‑time performance covers the speed from user behavior to model update. Sample stitching occurs within a 30‑minute window, Kafka streams are processed in milliseconds, parameter synchronization is done via RPC, and the online inference service pulls the latest parameters directly, achieving sub‑minute end‑to‑end latency.

2.2.2 Large‑Scale Deep Models

The recommendation models have evolved from FM to dual‑tower recall, cold‑DNN coarse‑ranking and multi‑task fine‑ranking, increasing feature counts, target numbers and model complexity, which brings substantial performance gains but also higher computational demands.

3. DeepRec Optimizations

3.1 OneDNN Operator Acceleration

DeepRec integrates the latest oneDNN library, unifying its thread pool with DeepRec’s Eigen pool to reduce thread‑switch overhead. Common sparse operators such as Select, DynamicStitch, Transpose, Tile, SparseSegmentMean, Unique and SparseSegmentSum receive substantial speedups.

3.1.1 Select Operator

Vectorized mask‑load instructions replace conditional branches, reducing branching overhead and improving data read/write efficiency.

3.1.2 Transpose Operator

Vectorized unpack and shuffle instructions perform block‑wise matrix transposition, yielding significant latency reductions.

3.2 Cost‑Aware Scheduling Engine

DeepRec redesigns the executor and thread pool to balance load, minimize stealing and lock contention, and uses a CostModel that collects runtime metrics to prioritize critical‑path nodes, reducing overall graph execution time.

3.3 Adaptive Memory/VRAM Allocator

The allocator learns allocation patterns from previous mini‑batches, builds a tensor‑cache planner, and pre‑allocates memory blocks to improve reuse and reduce fragmentation, especially for sparse models with many small and large tensors.

4. Business Impact

4.1 Service Performance Gains

After replacing the weidl backend with DeepRec in September, inference latency for the multi‑task fine‑ranking model dropped 50%, overall fine‑ranking latency fell 20%, and single‑node throughput rose 30%. Dual‑tower and cold‑DNN models also saw 20%‑10% latency reductions and 20%‑30% throughput improvements.

4.2 Additional Benefits

Reduced latency and higher throughput lower compute resource consumption, cutting costs. The faster inference allows more candidates to be processed at each stage, expanding the candidate pool and improving overall recommendation metrics.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization AI online learning DeepRec Weibo

Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.