Artificial Intelligence 15 min read

Deploying and Optimizing TensorFlow Serving for High‑Performance CTR Prediction

Meituan’s user‑growth team built a Wide‑Deep CTR prediction model, trained offline with Spark‑generated TFRecords, and deployed it via TensorFlow Serving on YARN, then applied request‑side multithreading, offline one‑hot preprocessing, XLA JIT compilation, and dedicated loading threads to cut end‑to‑end latency from ~18 ms to ~6 ms and eliminate model‑switch spikes.

Meituan Technology Team

Oct 11, 2018

Deploying and Optimizing TensorFlow Serving for High‑Performance CTR Prediction

This article introduces the business scenario of Meituan's user growth team for ad click‑through‑rate (CTR) prediction and the offline training workflow, then details the end‑to‑end process of deploying a Wide & Deep (WDL) model with TensorFlow Serving and the performance optimizations applied to meet strict online latency requirements.

Business Scenario and Offline Process

In the ad ranking pipeline, each user may be matched with hundreds of ads. The model must estimate the CTR for each ad within a 10 ms response window imposed by the ad exchange. Offline data is generated using Spark to produce native TensorFlow TFRecord files. The model is a classic Wide & Deep architecture with ~35 万 parameters (~11 MB) and uses both CPU and distributed training via TensorFlow synchronous training with Backup Workers and a GreedyLoadBalancing parameter server strategy. Estimator API is employed to encapsulate data loading, distributed training, validation, and model export.

TensorFlow Serving Overview

TensorFlow Serving is a high‑performance open‑source library for serving machine‑learning models via gRPC, supporting hot model updates and automatic version management. Meituan runs TensorFlow Serving on YARN clusters that periodically scan HDFS for new model versions.

During online serving, a batch of up to 100 ads for a user is sent to TensorFlow Serving, which returns CTR estimates. Initial measurements showed a total latency of ~18 ms (5 ms request packaging, 3 ms network, 10 ms model inference), exceeding the 10 ms target.

Performance Optimizations

3.2.1 Request‑side Optimization

Parallelized the preprocessing of 100 ads using OpenMP multithreading, reducing request packaging time from 5 ms to ~2 ms.

#pragma omp parallel for 
for (int i = 0; i < request->ad_feat_size(); ++i) {
    tensorflow::Example example;
    data_processing();
}

3.2.2 Model Ops Optimization

Originally, raw string features were One‑Hot encoded inside the model using high‑level tf.feature_column APIs, causing heavy CPU overhead (55.78% of training time). By preprocessing these features offline into One‑Hot indices stored in a local feature_index file and replacing high‑level APIs with lower‑level ops, the forward‑pass time dropped to 39.53% of total training time.

Profiling was performed with tf.profiler:

with tf.contrib.tfprof.ProfileContext(job_dir + '/tmp/train_dir') as pctx:
    estimator = tf.estimator.Estimator(model_fn=get_model_fn(job_dir),
                                        config=run_config,
                                        params=hparams)

3.2.3 XLA and JIT Compilation

Enabled XLA (Accelerated Linear Algebra) JIT compilation, allowing LLVM IR to generate optimized machine code for the high‑level optimizer (HLO) graph. Larger batch sizes benefited most, reducing execution time, though JIT adds a one‑time compilation overhead.

3.2.4 Final Performance

After all optimizations, model inference latency decreased from ~10 ms to 1.1 ms, request packaging from 5 ms to 2 ms, and total end‑to‑end latency to ~6 ms.

Model Switch “Spike” Issue

Model updates caused request timeouts due to shared thread pools between model loading/unloading and request handling, and lazy graph initialization on first request. The following configuration fixes the thread‑pool problem:

uint32 num_load_threads = 0;
uint32 num_unload_threads = 0;

Setting these values to 1 creates dedicated thread pools for loading/unloading, eliminating the spike. Additionally, a warm‑up inference after model load removes the first‑request latency. Post‑fix measurements show the spike reduced from ~84 ms to ~4 ms.

Conclusion and Outlook

The deployment demonstrates a robust, high‑performance online CTR prediction service built on TensorFlow Serving, with a complete offline‑online pipeline. Future work includes rapid model iteration (e.g., incorporating reinforcement learning), deeper graph and operator optimizations, and leveraging TensorFlow’s What‑If‑Tools for model analysis.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Model Deployment distributed training TensorFlow Serving

Written by

Meituan Technology Team

Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.