Artificial Intelligence 13 min read

Deep Learning System Design and Parallel Computing Solutions at Meituan

Meituan built a custom deep‑learning platform that combines data‑parallel and hybrid parallelism across multi‑GPU/cluster hardware, uses coarse‑grained scheduling and Kaldi‑derived acoustic algorithms, and supports fast NLU model hot‑updates, achieving near‑linear GPU scaling and 6–7× speedups over traditional solutions.

Meituan Technology Team

Oct 25, 2018

Deep Learning System Design and Parallel Computing Solutions at Meituan

Background : Deep learning is a core technology of the AI era and is computationally intensive. The article introduces Meituan's experience in designing systems for deep learning, covering computational requirements, industry solutions, hardware and communication architectures, and specific platforms for NLU and acoustic model training.

Computational Demand : Modern CNNs such as SENet require on the order of 10^18 FLOPs to train on ImageNet for 100 epochs. CPU cores provide ~10^10 FLOPS (≈160 GFLOPS usable), leading to >180 days for SENet, whereas a V100 GPU (≈14 TFLOPS peak, ~7 TFLOPS usable) can finish in ~4 days.

Industry Parallel Solutions :

Data Parallelism – each device holds a full model copy and processes different data batches, synchronizing periodically.

Model Parallelism – model layers are split across devices; useful when a single device cannot store large layers (e.g., massive Softmax).

Stream Parallelism – different devices compute different layers of the same batch, overlapping communication and computation.

Hybrid Parallelism – combines the above methods as needed.

Hardware Deployment Options :

Single‑machine single‑GPU.

Single‑machine multi‑GPU (1×4, 1×8, 1×10).

Multi‑machine multi‑GPU clusters with InfiniBand.

Custom accelerators (e.g., Google TPU).

Communication Solutions : For a ResNet model (230 MB, 11 GFLOPs per image, batch = 128), the per‑batch time on a V100 is ~0.23 s. PCI‑e (10 GB/s) and network (10 GB/s) transfers are negligible compared to GPU compute, but in multi‑GPU data‑parallel training the collective communication (Broadcast/Reduce) becomes a bottleneck. NCCL’s ring and NVLink topologies are illustrated as ways to accelerate these operations.

Meituan’s Custom Deep Learning Platform :

General platforms (TensorFlow, MXNet) lack domain‑specific features (e.g., speech feature extraction).

Graph‑based execution incurs fine‑grained scheduling overhead.

Kaldi’s acoustic training is too slow for production.

Therefore Meituan built a proprietary system that:

Uses coarser‑grained modeling units for simpler scheduling.

Adopts data parallelism with near‑linear scaling (4‑GPU speedup ≈ 3.8× under synchronous updates).

Integrates Kaldi’s specialized algorithms, achieving 6–7× speedup (20 h vs. 6–7 days on 800 h of data).

NLU Online System Design :

Business characteristics: frequent algorithmic changes, multi‑stage pipelines, need for hot model updates, and a data‑driven automatic iteration loop.

Algorithm abstraction: each algorithm depends on Slots and Resources; adapters convert inputs, operators execute, and parsers trigger downstream slots.

Hot‑update workflow: new queries are blocked while the model updates, then new queries use the new model while old queries finish with the previous version, after which the old model is released.

Acoustic Model Training System (Mimir) :

Coarse‑grained modeling units simplify task scheduling.

Data‑parallel training on a single machine with multiple GPUs achieves near‑linear acceleration.

Ported Kaldi’s specialized training algorithms, delivering 6–7× speedup.

Includes domain‑specific feature extraction modules.

References include NVIDIA’s NCCL paper and a Chinese blog on deep learning platform evolution.

Author : Jian Peng, algorithm expert at Meituan, responsible for acoustic model research and system design.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

system architecture parallel computing AI Infrastructure acoustic modeling NLU

Written by

Meituan Technology Team

Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.