Efficient Deployment Architecture for Visual Inference Services: GPU Utilization Optimization
Meituan Visual's engineering team tackled the common low‑GPU‑utilization bottleneck in online inference services by splitting model structures and adopting micro‑service deployment, raising GPU usage from 40% to 100% and more than tripling QPS, and then generalized the approach for other GPU‑based services.
0. Introduction
Online visual inference services exhibited low GPU utilization (≈40%) and high CPU‑bound preprocessing/post‑processing latency. Profiling with NVIDIA Nsight Systems identified the CPU preprocessing stage as the bottleneck that forced the GPU to wait for data.
Splitting the model into CPU‑only preprocessing/post‑processing components and a GPU‑only backbone, then deploying each as independent micro‑services, was proposed as a generic solution.
1. Background
GPU demand for inference is growing rapidly; surveys show >55% of AI‑related inference resources are already in use, yet many services under‑utilize GPU capacity.
2. Characteristics and Challenges of Visual Model Services
2.1 Optimization tools and deployment frameworks
TensorRT, TF‑TRT, TVM, OpenVINO – improve runtime via operator fusion, dynamic memory, precision calibration.
TensorFlow Serving, TorchServe, Nvidia Triton – provide model loading, versioning, batching, RPC/HTTP interfaces.
2.2 Visual model traits
Deep networks (e.g., ResNet‑50: 49 conv layers + 1 FC layer, ~25 M parameters, 3.8 × 10⁹ FLOPs) are GPU‑friendly.
Fixed input size (224×224) requires CPU preprocessing (decode, resize, crop).
2.3 Problems
Optimization tools focus on the backbone and ignore CPU preprocessing (e.g., tf.image.decode remains on CPU).
Deploying multiple models (detection → cropping → OCR) is difficult: TF‑Serving/TorchServe support a single format; Triton can handle multiple formats but requires complex custom backends and ensembles.
These issues create a “bucket‑effect” where CPU bottlenecks limit overall GPU performance.
3. GPU Service Optimization Practices
3.1 Image Classification Service
Workload: tens of millions of images per day for risk‑content filtering. Pipeline: CPU preprocessing (decode, resize, crop) → ResNet‑50 backbone. After TF‑TRT, only the backbone became a TensorRT engine; preprocessing stayed on CPU.
Nsight profiling showed GPU idle periods waiting for CPU‑prepared data, confirming the preprocessing stage as the bottleneck.
3.1.1 Optimization Methods
Increase CPU cores : Adding CPUs (up to 32 cores) raised GPU utilization to 88% but required hardware beyond typical 8‑CPU‑per‑GPU configurations.
Front‑load preprocessing : Preprocess large images offline, encode with lossless PNG, then feed smaller images. This doubled QPS but added extra latency and still left GPU under‑utilized.
Separate preprocessing micro‑service : Deploy preprocessing on a dedicated CPU service and the backbone on a GPU service. Data size after cropping (224×224×3 ≈ 143 KB) fits bandwidth limits for ≤10 k QPS, allowing unlimited horizontal scaling of the CPU service.
3.1.2 Results
Nsight traces showed the shortest CPU‑to‑GPU copy time for the separated preprocessing approach.
Benchmark (Intel Xeon Gold 5218, Tesla T4):
CPU 32 cores → GPU utilization 88%, QPS >2×.
Front‑loaded preprocessing → QPS ≈2×, GPU utilization still sub‑optimal.
Separated preprocessing → QPS ↑ 2.7×, GPU utilization 98% (near full load).
Conclusion: Decoupling CPU preprocessing from the GPU backbone fully exploits GPU capacity.
3.2 Detection + Classification Service
Pipeline: YOLOv5 detector + ResNet‑50 classifier, with CPU preprocessing and post‑processing (NMS, cropping). Original service showed 68% GPU utilization and limited QPS.
3.2.1 Optimization Method
Model split into four micro‑services:
CPU preprocessing service.
CPU post‑processing service (NMS, cropping).
GPU detector service (YOLOv5, TensorRT via Triton).
GPU classifier service (ResNet‑50, TensorRT via Triton).
Scheduler orchestrates the pipeline, providing a unified RPC interface (Thrift).
3.2.2 Results
Comparative tests:
CPU scaled to 32 cores → GPU utilization 90%, QPS ↑ 36%.
Triton Ensemble (all sub‑models on one machine) → negligible gain.
Micro‑service split with Triton → QPS ↑ 3.6×, GPU utilization 100%.
Increasing CPU alone reduced preprocessing latency but CPU‑GPU transfer remained dominant, limiting QPS. The micro‑service split decoupled workloads, allowing independent scaling of CPU services and full GPU exploitation.
4. Generalized Efficient Inference Deployment Architecture
Architecture: split any model into CPU‑bound preprocessing/post‑processing components and a GPU‑bound backbone; deploy each as separate micro‑services; use a scheduler to chain them. Underlying frameworks can be TF‑Serving, Triton, etc.
CPU services can be horizontally scaled to match traffic, preventing CPU bottlenecks.
GPU services contain only the backbone, achieving near‑100% utilization.
Parallel CPU‑GPU data transfer and GPU computation reduces idle time.
Latency impact is minimal: classification service latency increased from 42 ms to 45 ms after micro‑service conversion (RPC via Thrift).
5. Summary
Model‑structure splitting and micro‑service deployment raised GPU utilization from ~40% to near 100% and increased QPS by roughly threefold for both single‑model classification and detection‑plus‑classification services. The approach is applicable to any GPU‑bound inference workload.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Meituan Technology Team
Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
