Deep Customization of TensorFlow for Large-Scale Sparse Training at Meituan

Meituan heavily customized TensorFlow 1.x for large‑scale sparse training, replacing variable embeddings with hash tables, improving load balancing, using RDMA communication, pipeline‑embedding graphs, high‑performance hash tables, and operator merges, achieving over ten‑fold scalability, up to 51% operator speedups, and enabling billions‑parameter models on CPU clusters with future GPU expansion.

Meituan Technology Team
Meituan Technology Team
Meituan Technology Team
Deep Customization of TensorFlow for Large-Scale Sparse Training at Meituan

Background Meituan has heavily customized the open‑source TensorFlow framework (based on TensorFlow 1.x) to support large‑scale sparse parameters, training modes, distributed communication, pipeline and operator optimizations. In recommendation‑system scenarios the customized version achieves more than 10× scalability improvement and significant per‑compute performance gains.

Challenges of Large‑Scale Training The rapid growth of training data (from hundreds of millions to tens of billions), sparse parameters (from millions to billions), and model complexity caused training time to expand from hours to days. Additional issues include massive memory consumption for Variables, limited worker scalability, lack of dynamic sparse‑parameter handling for online learning, and frequent slow or failed nodes in large clusters.

Optimization Practices

3.1 Large‑Scale Sparse Parameter Support Replaced Variable‑based embeddings with a HashTable implementation that automatically expands, reduces memory waste, and enables online learning while keeping API compatibility.

3.2 Distributed Load‑Balancing Addressed uneven PS load caused by simple round‑robin slicing and heterogeneous hardware. Optimized Adam optimizer by replicating β parameters on each PS to eliminate hotspot contention, yielding ~9% performance gain.

3.3 Communication Optimizations Adopted RDMA (RoCE V2) to replace TCP/IP, reducing latency and CPU overhead. Implemented Memory Registration (MR) optimizations, a static MR allocator, multi‑RequestBuffer and CQ load‑balancing, and a Send‑Driven data‑exchange mode to cut rendezvous overhead.

3.4 Latency Optimizations Applied sparse‑parameter aggregation, embedding pipeline (splitting the graph into Embedding Graph and Main Graph) to overlap communication and computation, and introduced a Pipeline Dataset abstraction for transparent user experience.

3.5 Single‑Instance PS Concurrency Developed a high‑performance TBB‑based HashTable and a BucketPool memory‑pooling strategy, reducing allocation overhead and improving end‑to‑end training speed by ~5%.

3.6 Compute‑Throughput Optimizations Merged Unique and DynamicPartition operators, introduced a heuristic‑driven adaptive Unique implementation using Robin HashTable, achieving up to 51% speedup for the Unique operator and ~10% overall training acceleration.

Large‑Scale Sparse Algorithm Modeling Designed high‑dimensional sparse feature encoding for Meituan’s advertising business, improving model fitting capability and reducing feature‑collision overhead.

Summary and Outlook The customized TensorFlow enables training of models with billions of parameters and samples on CPU clusters, with plans to extend to GPU (e.g., NVIDIA A100) for even more complex workloads. Meituan will continue contributing to the TensorFlow Recommenders community.

Authors The work is authored by members of Meituan’s foundational R&D platform and advertising strategy team.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Performance OptimizationTensorFlowRecommendation SystemsDistributed TrainingSparse Parameters
Meituan Technology Team
Written by

Meituan Technology Team

Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.