Performance Optimization of Distributed TensorFlow for WDL Models at Meituan

Meituan‑Dianping identified data‑pipeline, network, and memory‑arena bottlenecks in distributed TensorFlow training of Wide & Deep recommendation models and resolved them by switching to tf.data pipelines, batching TFRecord reads, increasing MALLOC_ARENA_MAX, and moving embedding lookups to parameter servers, achieving 2–3× speedup and near‑linear scaling on up to 32 GPUs.

Meituan Technology Team
Meituan Technology Team
Meituan Technology Team
Performance Optimization of Distributed TensorFlow for WDL Models at Meituan

This article describes how Meituan‑Dianping identified and resolved performance bottlenecks when training Wide & Deep Learning (WDL) models with distributed TensorFlow.

Terminology TensorFlow – Google’s open‑source deep‑learning framework. OP – operation (TensorFlow operator). PS – Parameter Server. WDL – Wide & Deep Learning model for recommendation. AFO – AI Framework on YARN, a custom scheduler built on Hadoop/YARN.

Distributed TensorFlow Architecture TensorFlow uses a Parameter Server (PS) to store model parameters and Workers to perform training. The typical deployment follows a between‑graph (data‑parallel) model to avoid the slow in‑graph mode.

AFO Design AFO adds cluster resource management, fault‑tolerance, logging, and TensorBoard services on top of YARN. Its components include Application Master, AFO Child (execution engine), History Server, and AFO Client.

WDL Model Overview The model combines a linear “wide” part for memorization with a deep neural network for generalization. Sparse features are embedded, then concatenated with dense features and fed into a three‑layer ReLU network. The final prediction is produced by a logistic loss.

Performance Bottlenecks

Data input consumes >60% of training time due to inefficient TFRecordReader and Python GIL‑bound reader threads.

Network traffic spikes because embedding lookups transfer large tensors between PS and Workers.

Hadoop’s default MALLOC_ARENA_MAX limits glibc memory arenas, causing contention in PS UniqueOp.

Optimizations

Replace TFRecordReader with the TensorFlow tf.data.Dataset API (C++ threads) to eliminate GIL bottleneck.

Use TFRecordReader.read_up_to to batch reads, reducing read calls by orders of magnitude.

Set export MALLOC_ARENA_MAX="4" to a higher value (or remove it) to allow more concurrent mallocs.

Deploy embedding_lookup_sparse_with_distributed_aggregation so embedding ops run on the PS, cutting network traffic.

After these changes, data‑pipeline latency dropped from ~800 ms to <300 ms per step, and overall training speed increased 2–3×, achieving near‑linear scaling up to 32 GPUs. Removing the arena limit further boosted performance by ~10×.

Conclusion Targeted system‑level tuning of TensorFlow, Hadoop, and network configurations can dramatically accelerate distributed deep‑learning workloads, providing valuable engineering insights for large‑scale AI platforms.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Performance OptimizationTensorFlowDistributed TrainingAFOWDL
Meituan Technology Team
Written by

Meituan Technology Team

Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.