Engineering Practices for Large-Scale Deep Learning Models in Meituan Takeaway Advertising

The article details Meituan's engineering journey from small DNNs to hundred‑gigabyte deep learning models for food‑delivery ads, analyzing online latency and offline efficiency challenges and presenting distributed storage, CPU/GPU acceleration, OpenVINO, TensorRT, CodeGen, and data‑pipeline optimizations that dramatically improve throughput, memory usage, and sample‑building speed.

Meituan Technology Team
Meituan Technology Team
Meituan Technology Team
Engineering Practices for Large-Scale Deep Learning Models in Meituan Takeaway Advertising

Introduction

Meituan’s food‑delivery advertising system has evolved from shallow LR models to massive deep learning models with billions of parameters. This shift brings significant challenges across the entire serving pipeline, especially online latency and offline processing efficiency.

Background

CTR scenarios now use models ranging from a few megabytes to several hundred gigabytes. As data and model sizes grow, storage, communication, and computation costs increase, leading to longer processing times both online (latency) and offline (training and sample construction).

Analysis of Challenges

The authors identify three main pain points:

Online latency : More features increase I/O and feature‑calculation time; model size grows from megabytes to hundreds of gigabytes, raising FLOPs from millions to tens of millions, which CPU alone cannot handle.

Offline efficiency : Larger sample and feature volumes extend training time; batch processing bottlenecks and the need for “batch‑to‑stream” conversion are highlighted.

Pipeline issues : Full‑stack deployment, rollback, and monitoring become more complex with large models.

Model Inference

3.1 Distributed Storage

Parameters are split into Sparse (hundreds of GB to TB) and Dense (tens of MB). Sparse parameters are moved from single‑machine memory to a distributed KV store. The transformation involves two steps: (1) network‑graph reconstruction that replaces native GatherV2 ops with a custom MtGatherV2 distributed op, and (2) exporting Sparse parameters by sharding checkpoint files and loading them into HDFS and KV buckets.

3.2 CPU Acceleration

Two techniques are used:

Instruction‑set optimization (AVX2/AVX512) in TensorFlow compilation, yielding >30% throughput gain.

Accelerator libraries (TVM, OpenVINO). OpenVINO fuses linear operators (e.g., Conv+BN+ReLU) and performs data‑type calibration to FP16/INT8, achieving 40% higher throughput and 15% lower latency compared with the baseline CPU implementation.

3.3 GPU Acceleration

GPU is employed for compute‑intensive layers (MLP) while keeping embedding look‑ups on CPU. The authors adopt a two‑stage TensorFlow + TensorRT pipeline, selecting TensorRT for its deep operator optimizations and plugin extensibility. Key techniques include:

Layer fusion (e.g., CBR fusion) to reduce kernel launches.

Kernel auto‑tuning across CUDA cores and Tensor cores.

Dynamic shape support to avoid unnecessary padding.

Multi‑model, multi‑context, and multi‑stream execution to improve utilization.

CUDA Graph to batch kernel launches, cutting launch overhead from microseconds to a single graph launch.

GPU cache (embedding cache in GPU memory) provides >10× speedup over CPU cache, and a three‑level cache hierarchy (GPU → CPU → SSD/KV) satisfies >90% of requests without hitting slower storage.

Feature Service CodeGen Optimization

Feature extraction, originally interpreted from DSL, is now compiled via a CodeGen pipeline inspired by Spark’s WholeStageCodeGen. The process consists of a FrontEnd that parses DSL into an AST/DAG, an Optimizer that performs common‑subexpression elimination and constant folding, and a BackEnd that emits bytecode. This reduces runtime overhead and CPU load, delivering noticeable throughput gains.

Sample Construction

To ensure online‑offline consistency, the system moves from full‑snapshot joins to a staged approach:

Streaming samples : Real‑time exposure/click streams are joined with feature snapshots in memory, reducing latency but consuming large Kafka resources.

KV‑cache solution : Feature snapshots are cached in Redis for a few minutes; downstream jobs pull only the needed items, dramatically lowering memory pressure.

Further optimizations include:

Data splitting into context and structured feature stores, cutting storage pressure.

Pre‑filtering with Bloom filters before joins, reducing I/O.

Huffman‑tree‑based join ordering to minimize data shuffling.

These changes achieve >80% storage reduction and >200% improvement in sample‑building speed.

Data Preparation

The platform introduces “add”, “subtract”, “multiply”, and “divide” operations on features:

Add : Automated feature recommendation by injecting new features into model configs, retraining, and evaluating offline metrics.

Subtract : Feature scoring (WOE, etc.) and systematic removal, resulting in a 40% feature reduction with negligible impact on business metrics.

Multiply : Data‑bank construction that shares samples and embeddings across business lines, boosting AUC by 0.4% and CPM by ~1%.

Divide : Cost‑aware feature selection that maximizes value under resource constraints, enabling traffic‑aware model switching.

Summary and Outlook

The engineering practices described have reduced latency, cut storage costs, and accelerated model iteration for Meituan’s ad system. Future work includes full‑stack GPU‑ification, building a massive sample data lake, standardizing the end‑to‑end pipeline (MLOps), and automating data‑model matching.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

feature engineeringDeep LearningGPU accelerationTensorRTdistributed storagemodel inferenceOpenVINOCPU acceleration
Meituan Technology Team
Written by

Meituan Technology Team

Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.