Artificial Intelligence 32 min read

Deep Customization and Optimization of TensorFlow for Large-Scale Sparse Training at Meituan

This article details Meituan's internal, heavily customized TensorFlow 1.x implementation that addresses large‑scale sparse parameter support, distributed training challenges, communication bottlenecks, and pipeline optimizations, achieving over ten‑fold scalability improvements and significant per‑node performance gains in recommendation system workloads.

DataFunTalk
DataFunTalk
DataFunTalk
Deep Customization and Optimization of TensorFlow for Large-Scale Sparse Training at Meituan

1 Background

TensorFlow, Google’s open‑source deep‑learning framework, is widely used in Meituan’s recommendation systems, but the official version lacks many features needed for industrial‑scale scenarios. Challenges include massive memory consumption for billions of sparse parameters, limited worker scalability, lack of online‑learning support, and handling slow or failed nodes.

2 Large‑Scale Training Optimization Challenges

2.1 Business‑driven challenges

Training data grew from hundreds of millions to tens of billions of samples, sparse parameters increased ten‑fold, and model complexity caused step‑time to rise over ten times, turning a day‑long experiment into several days.

2.2 System load analysis

Using Meituan’s CAT monitoring, a fine‑grained TensorFlow PS monitoring chain was built (see Fig.1). An automated experiment framework (Fig.2) collects metrics and generates reports.

Analysis of the Parameter Server (PS) architecture revealed communication pressure, PS concurrency pressure, and worker compute pressure. Scaling PS beyond a certain point increased step time due to link latency (Fig.5).

3 Optimization Practices

3.1 Large‑scale Sparse Parameter Support

Replaced Variable‑based embeddings with HashTable‑based storage, allowing dynamic resizing, faster training, and online‑learning capability while keeping API compatibility.

HashTable auto‑scales during training, eliminating pre‑allocation waste.

Custom optimizations boost training speed for trillion‑scale models.

Supports online learning via dynamic sparse parameter growth.

API remains compatible with native TensorFlow.

3.2 Distributed Load‑Balancing Optimization

Balanced sparse and dense parameter placement across PS instances, mitigated Adam optimizer hotspot by replicating β parameters locally, achieving ~9% performance gain and better scalability.

3.3 Communication Optimization

Implemented RDMA‑based communication (RoCE V2) with memory‑registration (MR) optimizations, static MR allocator, multi‑RequestBuffer and CQ load‑balancing, and a Send‑Driven data‑exchange mode, delivering 20‑60% throughput improvements.

3.4 Latency Optimization

Aggregated HashTable values (embedding, m, v, counters) to reduce Lookup/Insert frequency, and introduced an Embedding pipeline that splits the graph into an Embedding Graph (EG) and Main Graph (MG), overlapping communication and computation for 20‑60% speedup.

3.5 Single‑Instance PS Concurrency Optimization

Adopted TBB concurrent hash map for high‑performance HashTable, added BucketPool with reuse queues to cut memory‑allocation overhead, improving end‑to‑end performance by ~5%.

3.6 Per‑Node Throughput Optimization

Optimized Unique and DynamicPartition operators, introduced heuristic‑driven adaptive Unique implementation using Robin HashTable, and merged Unique with Partition, achieving 51% acceleration for Unique and ~10% overall model speedup.

4 Large‑Scale Sparse Algorithm Modeling

Redesigned high‑dimensional sparse feature encoding for Meituan’s advertising, reducing hash collisions and feature processing overhead, while platform‑level optimizations accelerated training throughput by 8‑10×.

5 Summary and Outlook

Meituan’s customized TensorFlow now supports trillion‑scale sparse parameters and samples, with extensive optimizations across memory, communication, and scheduling. Future work includes GPU‑based training on NVIDIA A100, further scaling, and active contribution to the TensorFlow Recommenders community.

6 Authors

Yifan, Jia‑heng, Zheng‑shao, Peng‑peng, Yong‑yu, Zheng‑yang, Huang‑jun (Meituan R&D Platform, ML training engine), and Hai‑tao (Meituan Takeaway Ads team).

7 References

[1] https://www.usenix.org/system/files/conference/osdi16/osdi16-abadi.pdf [2] https://github.com/dianping/cat [3] https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-li_mu.pdf [4] https://github.com/tensorflow/networking/tree/master/tensorflow_networking/verbs [5] https://labs.criteo.com/2013/12/download-terabyte-click-logs/ [6] https://arxiv.org/abs/1906.00091 [7] https://github.com/tensorflow/networking/tree/master/tensorflow_networking/seastar [8] https://github.com/bytedance/byteps [9] http://research.baidu.com/Public/uploads/5e18a1017a7a0.pdf [10] https://github.com/oneapi-src/oneTBB [11] https://github.com/tensorflow/recommenders-addons

Performance OptimizationTensorFlowrecommendation systemsdistributed trainingSparse Parameters
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.