Deep Customization and Optimization of TensorFlow for Large-Scale Sparse Training at Meituan
This article details Meituan's internal, heavily customized TensorFlow 1.x implementation that addresses large‑scale sparse parameter support, distributed training challenges, communication bottlenecks, and pipeline optimizations, achieving over ten‑fold scalability improvements and significant per‑node performance gains in recommendation system workloads.
1 Background
TensorFlow, Google’s open‑source deep‑learning framework, is widely used in Meituan’s recommendation systems, but the official version lacks many features needed for industrial‑scale scenarios. Challenges include massive memory consumption for billions of sparse parameters, limited worker scalability, lack of online‑learning support, and handling slow or failed nodes.
2 Large‑Scale Training Optimization Challenges
2.1 Business‑driven challenges
Training data grew from hundreds of millions to tens of billions of samples, sparse parameters increased ten‑fold, and model complexity caused step‑time to rise over ten times, turning a day‑long experiment into several days.
2.2 System load analysis
Using Meituan’s CAT monitoring, a fine‑grained TensorFlow PS monitoring chain was built (see Fig.1). An automated experiment framework (Fig.2) collects metrics and generates reports.
Analysis of the Parameter Server (PS) architecture revealed communication pressure, PS concurrency pressure, and worker compute pressure. Scaling PS beyond a certain point increased step time due to link latency (Fig.5).
3 Optimization Practices
3.1 Large‑scale Sparse Parameter Support
Replaced Variable‑based embeddings with HashTable‑based storage, allowing dynamic resizing, faster training, and online‑learning capability while keeping API compatibility.
HashTable auto‑scales during training, eliminating pre‑allocation waste.
Custom optimizations boost training speed for trillion‑scale models.
Supports online learning via dynamic sparse parameter growth.
API remains compatible with native TensorFlow.
3.2 Distributed Load‑Balancing Optimization
Balanced sparse and dense parameter placement across PS instances, mitigated Adam optimizer hotspot by replicating β parameters locally, achieving ~9% performance gain and better scalability.
3.3 Communication Optimization
Implemented RDMA‑based communication (RoCE V2) with memory‑registration (MR) optimizations, static MR allocator, multi‑RequestBuffer and CQ load‑balancing, and a Send‑Driven data‑exchange mode, delivering 20‑60% throughput improvements.
3.4 Latency Optimization
Aggregated HashTable values (embedding, m, v, counters) to reduce Lookup/Insert frequency, and introduced an Embedding pipeline that splits the graph into an Embedding Graph (EG) and Main Graph (MG), overlapping communication and computation for 20‑60% speedup.
3.5 Single‑Instance PS Concurrency Optimization
Adopted TBB concurrent hash map for high‑performance HashTable, added BucketPool with reuse queues to cut memory‑allocation overhead, improving end‑to‑end performance by ~5%.
3.6 Per‑Node Throughput Optimization
Optimized Unique and DynamicPartition operators, introduced heuristic‑driven adaptive Unique implementation using Robin HashTable, and merged Unique with Partition, achieving 51% acceleration for Unique and ~10% overall model speedup.
4 Large‑Scale Sparse Algorithm Modeling
Redesigned high‑dimensional sparse feature encoding for Meituan’s advertising, reducing hash collisions and feature processing overhead, while platform‑level optimizations accelerated training throughput by 8‑10×.
5 Summary and Outlook
Meituan’s customized TensorFlow now supports trillion‑scale sparse parameters and samples, with extensive optimizations across memory, communication, and scheduling. Future work includes GPU‑based training on NVIDIA A100, further scaling, and active contribution to the TensorFlow Recommenders community.
6 Authors
Yifan, Jia‑heng, Zheng‑shao, Peng‑peng, Yong‑yu, Zheng‑yang, Huang‑jun (Meituan R&D Platform, ML training engine), and Hai‑tao (Meituan Takeaway Ads team).
7 References
[1] https://www.usenix.org/system/files/conference/osdi16/osdi16-abadi.pdf [2] https://github.com/dianping/cat [3] https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-li_mu.pdf [4] https://github.com/tensorflow/networking/tree/master/tensorflow_networking/verbs [5] https://labs.criteo.com/2013/12/download-terabyte-click-logs/ [6] https://arxiv.org/abs/1906.00091 [7] https://github.com/tensorflow/networking/tree/master/tensorflow_networking/seastar [8] https://github.com/bytedance/byteps [9] http://research.baidu.com/Public/uploads/5e18a1017a7a0.pdf [10] https://github.com/oneapi-src/oneTBB [11] https://github.com/tensorflow/recommenders-addons
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.