GPU Optimization Practices for CTR Models at Meituan
Meituan accelerates CTR model inference by fusing operators with TVM, optimizing CPU‑GPU data transfers, manually tuning high‑frequency subgraphs, and dynamically offloading workloads, achieving up to ten‑fold throughput gains on Tesla T4 GPUs while keeping latency stable and only modestly increasing beyond 128 QPS, though compilation remains slow and large‑model support needs improvement.
Click‑Through‑Rate (CTR) models are widely used in search, recommendation, and advertising. With the introduction of deep neural networks, inference demands have grown, prompting Meituan to explore GPU‑based acceleration to reduce latency, increase throughput, and cut costs.
The challenges are three‑fold: (1) Application layer – diverse model structures, large embedding tables that may exceed GPU memory, and the need for rapid online updates; (2) Framework layer – TensorFlow and PyTorch expose fine‑grained operators, causing extra overhead on both CPU and GPU; (3) Hardware layer – thousands of tiny operators translate into many short‑lived GPU kernels, leading to high launch overhead and memory‑bandwidth bottlenecks.
To address these issues, Meituan adopted four main optimization techniques: operator fusion, CPU‑GPU data‑transfer optimization, high‑frequency subgraph manual tuning, and dynamic CPU‑GPU offloading.
Operator fusion is realized via TVM, which automatically merges small operators into larger, semantically equivalent kernels, dramatically reducing kernel launch count and memory traffic. A TF‑TVM partitioning workflow extracts TVM‑compatible subgraphs while keeping the rest in TensorFlow, similar to TF‑TRT but with looser coupling to the TensorFlow source.
CPU‑GPU data‑transfer overhead is mitigated by merging inputs of identical shape and dtype before the TVM subgraph, then splitting them afterward. This reduces the number of small H2D/D2H copies, though it must respect the 4 KB kernel‑argument limit for very large input counts.
For subgraphs unsupported by TVM, manual high‑performance GPU kernels are written. An example is the StringEmbedding pipeline, where a custom GPU implementation uses warp_shuffle and Scan/Reduce algorithms to replace a CPU‑bound sequence, cutting latency from 42 ms to 1.83 ms.
CPU‑GPU offloading is driven by request batch size. Small‑batch requests are processed on CPU, while larger batches are routed to GPU. A batch‑bucketing strategy selects the nearest pre‑optimized kernel configuration, achieving balanced resource utilization (≈77 % GPU, 23 % CPU in production).
Performance tests on a Tesla T4 show that GPU inference delivers up to 10× higher throughput than CPU, with latency remaining stable up to 128 QPS and only modestly increasing beyond that.
The overall architecture abstracts the optimization flow into a platform that automatically analyses model graphs, applies suitable strategies, and validates correctness and performance.
Remaining limitations include long TVM compilation times (≈20 min per model) and the need for better support of extremely large models and online model updates. Future work will focus on accelerating compilation and enhancing usability.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Meituan Technology Team
Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
