Deep Customization of TensorFlow for Large-Scale Sparse Training at Meituan
Meituan heavily customized TensorFlow 1.x for large‑scale sparse training, replacing variable embeddings with hash tables, improving load balancing, using RDMA communication, pipeline‑embedding graphs, high‑performance hash tables, and operator merges, achieving over ten‑fold scalability, up to 51% operator speedups, and enabling billions‑parameter models on CPU clusters with future GPU expansion.
Background Meituan has heavily customized the open‑source TensorFlow framework (based on TensorFlow 1.x) to support large‑scale sparse parameters, training modes, distributed communication, pipeline and operator optimizations. In recommendation‑system scenarios the customized version achieves more than 10× scalability improvement and significant per‑compute performance gains.
Challenges of Large‑Scale Training The rapid growth of training data (from hundreds of millions to tens of billions), sparse parameters (from millions to billions), and model complexity caused training time to expand from hours to days. Additional issues include massive memory consumption for Variables, limited worker scalability, lack of dynamic sparse‑parameter handling for online learning, and frequent slow or failed nodes in large clusters.
Optimization Practices
3.1 Large‑Scale Sparse Parameter Support Replaced Variable‑based embeddings with a HashTable implementation that automatically expands, reduces memory waste, and enables online learning while keeping API compatibility.
3.2 Distributed Load‑Balancing Addressed uneven PS load caused by simple round‑robin slicing and heterogeneous hardware. Optimized Adam optimizer by replicating β parameters on each PS to eliminate hotspot contention, yielding ~9% performance gain.
3.3 Communication Optimizations Adopted RDMA (RoCE V2) to replace TCP/IP, reducing latency and CPU overhead. Implemented Memory Registration (MR) optimizations, a static MR allocator, multi‑RequestBuffer and CQ load‑balancing, and a Send‑Driven data‑exchange mode to cut rendezvous overhead.
3.4 Latency Optimizations Applied sparse‑parameter aggregation, embedding pipeline (splitting the graph into Embedding Graph and Main Graph) to overlap communication and computation, and introduced a Pipeline Dataset abstraction for transparent user experience.
3.5 Single‑Instance PS Concurrency Developed a high‑performance TBB‑based HashTable and a BucketPool memory‑pooling strategy, reducing allocation overhead and improving end‑to‑end training speed by ~5%.
3.6 Compute‑Throughput Optimizations Merged Unique and DynamicPartition operators, introduced a heuristic‑driven adaptive Unique implementation using Robin HashTable, achieving up to 51% speedup for the Unique operator and ~10% overall training acceleration.
Large‑Scale Sparse Algorithm Modeling Designed high‑dimensional sparse feature encoding for Meituan’s advertising business, improving model fitting capability and reducing feature‑collision overhead.
Summary and Outlook The customized TensorFlow enables training of models with billions of parameters and samples on CPU clusters, with plans to extend to GPU (e.g., NVIDIA A100) for even more complex workloads. Meituan will continue contributing to the TensorFlow Recommenders community.
Authors The work is authored by members of Meituan’s foundational R&D platform and advertising strategy team.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Meituan Technology Team
Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
