Artificial Intelligence 14 min read

Wuliang: Tencent's Deep Learning Framework for Real‑Time Large‑Scale Recommendation

The presentation by Tencent expert Yuan Yi details the Wuliang deep learning system for recommendation, covering its background, technical challenges such as massive data and real‑time requirements, the parameter‑server based solutions for training and inference, model compression techniques, and continuous online deployment strategies.

DataFunSummit
DataFunSummit
DataFunSummit
Wuliang: Tencent's Deep Learning Framework for Real‑Time Large‑Scale Recommendation

This talk, presented by Yuan Yi (Ph.D., Tencent expert researcher), introduces the Wuliang deep learning system built for large‑scale recommendation scenarios, where real‑time data, massive user/item volumes, and constantly changing modeling goals pose significant challenges.

1. AI Full‑Process and Wuliang's Role

The diagram shows two deep‑learning framework types: content‑understanding and recommendation. Recommendation differs in three main aspects:

Data Real‑Time : User behavior is generated instantly, requiring the system to ingest, sample, train, and serve recommendations with low latency.

Variable Modeling Objectives : As user contexts change, the model must capture evolving interests, unlike static content‑understanding tasks.

Scale : Billions of exposures, clicks, and daily active users demand high‑throughput training and serving.

2. Technical Bottlenecks in Recommendation Business

Huge data volume (hundreds of billions of exposures, billions of clicks, millions of DAU).

Strong user interaction relationships demand higher timeliness.

Need to train TB‑scale models and serve them online.

Online service must achieve high throughput with sub‑10 ms latency.

3. Wuliang's Solutions

3.1 Computation Framework

Wuliang adopts a parameter‑server architecture that fetches only the needed sparse keys per batch, reducing memory traffic. It bridges TensorFlow’s graph construction and automatic differentiation with the parameter server, enabling high‑performance distributed training and GPU deployment.

3.2 Inference Service

Serving must handle TB‑scale models where a single request may activate up to 150 k keys.

Multi‑replica, in‑memory serving with optional strong consistency via versioned access.

Distributed serving cluster uses multi‑replica parallel reads, lock‑free mechanisms, and L1 cache optimizations to meet high QPS and low latency.

3.3 Continuous Online Deployment

Wuliang slices the model into DNN and sparse embedding parts, deploying them to different nodes. It supports three update granularities:

Full Model : TB‑scale updates during low‑traffic windows or 24 h cycles.

Incremental Model : GB‑scale updates every hour or ten minutes.

Real‑Time Model : KB‑scale updates via message queues for millisecond‑level freshness.

3.4 Model Compression

Traditional CV/NLP compression (distillation, quantization) is unsuitable for recommendation due to real‑time constraints. Wuliang focuses on sparse‑layer optimizations:

Fewer values : Variable‑length embeddings allocate fewer values to low‑frequency features.

Fewer keys : Group Lasso (L21 regularization) removes redundant sparse keys, achieving 64‑94 % sparsity.

Mixed‑precision & quantization : Float16/int8/int4, 1‑bit/2‑bit quantization, and hardware‑friendly kernels accelerate inference.

4. Evolution of Wuliang

The system progresses from pure scale‑out (adding nodes) to scale‑up (optimizing single‑node performance) and finally to multi‑level caching (GPU memory, host memory, SSD) to handle TB‑scale models while keeping latency low.

Q&A Highlights

Serving replicas store parameters in memory; consistency can be enforced via versioned reads.

Three‑value quantization keeps the full‑precision model for training, quantizes for inference, and updates the full model with gradients.

SSD‑based parameter caching can meet training speed requirements when pre‑fetching the next batch.

Zero‑copy communication avoids extra data copies by pre‑arranging key ordering across servers.

If training data is polluted, the system rolls back to the last clean model checkpoint.

Feature engineering can be performed either upstream by the business or within Wuliang’s inference plugin.

Thank you for attending.

deep learningmodel compressionrecommendation systemsparameter serverreal-time inferenceLarge-Scale Training
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.