Booster GPU Training Architecture for Recommendation Systems at Meituan: Design, Optimization, and Deployment

Meituan’s Booster architecture co‑designs algorithm and system to run TensorFlow recommendation training on multi‑GPU A100 servers, optimizing data fetching, embedding pipelines, custom kernels and communication fusion, delivering 2–4× cost‑performance over CPU, over threefold GPU throughput, and seamless deployment via a single‑line API.

Meituan Technology Team
Meituan Technology Team
Meituan Technology Team
Booster GPU Training Architecture for Recommendation Systems at Meituan: Design, Optimization, and Deployment

Meituan's Machine Learning Platform built a custom TensorFlow‑based GPU training architecture called Booster to accelerate recommendation system training. The architecture was designed with algorithm‑system co‑design, considering data, compute and communication characteristics of modern NVIDIA A100 servers.

Background : Traditional CPU‑based Parameter Server training could not keep up with the growing sample size (hundreds of billions) and model complexity of Meituan's delivery recommendation models. GPU servers (A100 8‑card) provide high compute but lack efficient training pipelines for recommendation workloads.

GPU Training Challenges : Large sample volume, massive sparse feature embeddings, relatively low per‑step compute, and limited GPU memory make it difficult to fully utilize GPU resources.

System Design and Implementation

Booster adopts a “algorithm + system” co‑design. The first version is a single‑node multi‑GPU (8‑card) implementation. Core modules:

Data module – multi‑NIC, NUMA‑aware data fetching, per‑GPU shared memory, zero‑copy feature parsing.

Compute module – per‑GPU TensorFlow processes, embedding pipeline (separating embedding graph and main graph), custom GPU operators, XLA integration.

Communication module – Horovod‑based AllToAll/AllReduce, HashTable and Variable fusion to reduce synchronization overhead.

Key Optimizations

Data layer : NUMA‑binding, multi‑NIC download, per‑GPU shared memory, SIMD‑accelerated Varint parsing, pinned‑memory H2D pipeline.

Compute layer : Embedding pipeline with overlapping EG and MG graphs, hand‑crafted GPU kernels for Unique, DynamicPartition, Gather, XLA local caching and const‑memcpy elimination.

Communication layer : Fusion of HashTable and Variable ops, reducing kernel launches and synchronization calls by >90%.

Performance results show end‑to‑end training speedup of 40 % (data), 60 % (compute) and 85 % (communication), achieving overall 2–4× cost‑performance over CPU Parameter Server. In Meituan’s delivery recommendation, Booster yields >3× GPU throughput versus native TensorFlow GPU and >4× versus optimized CPU.

Business Deployment : Booster supports full TensorFlow Estimator workflow (Train/Evaluate/Predict) on GPU with a single‑line API: tf.enable_gpu_booster() It enables seamless migration from CPU to GPU, supports various GPU card counts (1,2,4,8), checkpoint compatibility, and provides dedicated GPU resources for evaluation and prediction.

Summary and Outlook : Booster demonstrates that algorithm‑aware system co‑design can unlock the potential of modern GPU hardware for large‑scale recommendation training. Future work includes further data compression, multi‑node scaling, advanced XLA support for dynamic shapes, and quantized communication.

Authors : Jia Heng, Guo Qing, Zheng Shao, Xiao Guang, Peng Peng, Yong Yu, Jun Wen, Zheng Yang, Rui Dong, Xiang Yu, Xiu Feng, Wang Qing, Feng Yu, Shi Feng, Huang Jun, etc., from Meituan’s Basic R&D Platform and Search Recommendation Team.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

TensorFlowGPU trainingBooster architecture
Meituan Technology Team
Written by

Meituan Technology Team

Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.