GPU Optimization Practices for Meituan Delivery Search and Recommendation Model Inference

Meituan’s delivery search and recommendation service migrated from separate CPU‑only models to a unified multi‑task model running on a heterogeneous CPU‑GPU architecture, applying system‑level placement, All‑On‑GPU lookup, FP16 mixed precision, operator fusion, TensorRT and TVM compilation, which together delivered roughly a four‑fold increase in inference throughput while maintaining cost.

Meituan Technology Team
Meituan Technology Team
Meituan Technology Team
GPU Optimization Practices for Meituan Delivery Search and Recommendation Model Inference

GPU and other specialized chips have become essential for large‑scale machine learning, especially in the AI era. This article shares the design and deployment of a GPU‑based inference architecture for Meituan’s delivery search and recommendation services, aiming to help engineers working on similar applications.

1. Introduction Recent years have seen rapid growth of machine learning, with GPUs offering high‑performance, low‑cost compute. Practitioners often wonder how to leverage GPUs for their business, how to transition from CPU‑only pipelines, and what impact this has on model design.

2. Background Meituan Delivery distributes traffic via search and recommendation across multiple entry points (home page, “golden” sections, in‑store pages). CTR/CVR models are core to ranking and conversion, and the traditional approach maintained separate models per entry, leading to high maintenance cost and fragmented training data.

3. Model Design in the Delivery Scenario The team moved from multiple single‑task models to a unified “One Model to Serve All” architecture, integrating CTR, CVR, and CXR predictions, and combining scene‑expert and attention networks to share knowledge across entry points.

4. Service Architecture Overview The online inference service consists of three main components: Dispatch (feature extraction), Engine (GPU‑accelerated inference on a GPU‑BOX with 1 × Tesla T4 and 8 CPU cores), and Booster (offline optimizer that applies hand‑crafted and DL‑compiler optimizations).

5. GPU Optimization Practices

5.1 System Optimizations

Device placement: manually assign heavy sub‑graphs (Attention, MLP) to GPU and light sub‑graphs (Embedding lookup) to CPU, then reduce H2D/D2H transfers from thousands to a few by concatenating tensors before transfer.

All‑On‑GPU: move the entire graph, including sparse lookup, to GPU by implementing a GPU‑based LookupTable op, achieving a 4× QPS increase (55→220).

Operator fusion: merge thousands of nodes into fewer fused kernels, turning memory accesses into register accesses and cutting launch overhead.

5.2 Compute Optimizations

FP16 mixed‑precision: adopt half‑precision inference (no noticeable loss) to boost throughput.

Broadcast reduction: postpone broadcasting of user embeddings until after interaction with items, eliminating redundant queries.

High‑performance libraries: leverage Tensor Cores via cuBLAS/cuDNN on T4, and Intel MKL on CPU when applicable.

5.3 DL‑Compiler‑Based Automatic Optimizations

TensorRT: manual sub‑graph partitioning and operator replacement (e.g., replace unsupported Select with Multiply) to increase coverage.

TVM: integrate a TVMEngineOp to compile heavy sub‑graphs (Attention, MLP) with TVM, achieving higher operator coverage than TensorRT and further performance gains.

6. Performance Results Benchmarks on a 32‑core Xeon + Tesla T4 platform show:

CPU‑only inference caps at 55 QPS (CPU 76% utilization).

Hand‑crafted GPU optimizations raise QPS to 85 (≈55% improvement) but CPU remains the bottleneck.

TensorRT + FP16 reduces latency ~40% at the same QPS, yet CPU limits throughput.

TVM + All‑On‑GPU + FP16 cuts latency ~70% and boosts QPS to 220 (≈4× overall), shifting the bottleneck to GPU utilization.

7. Conclusion By evolving from isolated CTR/CVR models to a unified multi‑task model and migrating inference from pure CPU to a CPU+GPU heterogeneous architecture, Meituan achieved a near‑fourfold increase in inference throughput while keeping cost stable. The combination of manual system tuning, DL‑compiler optimizations, and full‑GPU execution proved effective for large‑scale recommendation workloads.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

TensorFlowGPUTVM
Meituan Technology Team
Written by

Meituan Technology Team

Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.