Efficient Deployment Architecture for Visual Inference Services: GPU Utilization Optimization

Meituan Visual's engineering team tackled the common low‑GPU‑utilization bottleneck in online inference services by splitting model structures and adopting micro‑service deployment, raising GPU usage from 40% to 100% and more than tripling QPS, and then generalized the approach for other GPU‑based services.

Meituan Technology Team
Meituan Technology Team
Meituan Technology Team
Efficient Deployment Architecture for Visual Inference Services: GPU Utilization Optimization

0. Introduction

Online visual inference services exhibited low GPU utilization (≈40%) and high CPU‑bound preprocessing/post‑processing latency. Profiling with NVIDIA Nsight Systems identified the CPU preprocessing stage as the bottleneck that forced the GPU to wait for data.

Splitting the model into CPU‑only preprocessing/post‑processing components and a GPU‑only backbone, then deploying each as independent micro‑services, was proposed as a generic solution.

1. Background

GPU demand for inference is growing rapidly; surveys show >55% of AI‑related inference resources are already in use, yet many services under‑utilize GPU capacity.

2. Characteristics and Challenges of Visual Model Services

2.1 Optimization tools and deployment frameworks

TensorRT, TF‑TRT, TVM, OpenVINO – improve runtime via operator fusion, dynamic memory, precision calibration.

TensorFlow Serving, TorchServe, Nvidia Triton – provide model loading, versioning, batching, RPC/HTTP interfaces.

2.2 Visual model traits

Deep networks (e.g., ResNet‑50: 49 conv layers + 1 FC layer, ~25 M parameters, 3.8 × 10⁹ FLOPs) are GPU‑friendly.

Fixed input size (224×224) requires CPU preprocessing (decode, resize, crop).

2.3 Problems

Optimization tools focus on the backbone and ignore CPU preprocessing (e.g., tf.image.decode remains on CPU).

Deploying multiple models (detection → cropping → OCR) is difficult: TF‑Serving/TorchServe support a single format; Triton can handle multiple formats but requires complex custom backends and ensembles.

These issues create a “bucket‑effect” where CPU bottlenecks limit overall GPU performance.

3. GPU Service Optimization Practices

3.1 Image Classification Service

Workload: tens of millions of images per day for risk‑content filtering. Pipeline: CPU preprocessing (decode, resize, crop) → ResNet‑50 backbone. After TF‑TRT, only the backbone became a TensorRT engine; preprocessing stayed on CPU.

Nsight profiling showed GPU idle periods waiting for CPU‑prepared data, confirming the preprocessing stage as the bottleneck.

3.1.1 Optimization Methods

Increase CPU cores : Adding CPUs (up to 32 cores) raised GPU utilization to 88% but required hardware beyond typical 8‑CPU‑per‑GPU configurations.

Front‑load preprocessing : Preprocess large images offline, encode with lossless PNG, then feed smaller images. This doubled QPS but added extra latency and still left GPU under‑utilized.

Separate preprocessing micro‑service : Deploy preprocessing on a dedicated CPU service and the backbone on a GPU service. Data size after cropping (224×224×3 ≈ 143 KB) fits bandwidth limits for ≤10 k QPS, allowing unlimited horizontal scaling of the CPU service.

3.1.2 Results

Nsight traces showed the shortest CPU‑to‑GPU copy time for the separated preprocessing approach.

Benchmark (Intel Xeon Gold 5218, Tesla T4):

CPU 32 cores → GPU utilization 88%, QPS >2×.

Front‑loaded preprocessing → QPS ≈2×, GPU utilization still sub‑optimal.

Separated preprocessing → QPS ↑ 2.7×, GPU utilization 98% (near full load).

Conclusion: Decoupling CPU preprocessing from the GPU backbone fully exploits GPU capacity.

3.2 Detection + Classification Service

Pipeline: YOLOv5 detector + ResNet‑50 classifier, with CPU preprocessing and post‑processing (NMS, cropping). Original service showed 68% GPU utilization and limited QPS.

3.2.1 Optimization Method

Model split into four micro‑services:

CPU preprocessing service.

CPU post‑processing service (NMS, cropping).

GPU detector service (YOLOv5, TensorRT via Triton).

GPU classifier service (ResNet‑50, TensorRT via Triton).

Scheduler orchestrates the pipeline, providing a unified RPC interface (Thrift).

3.2.2 Results

Comparative tests:

CPU scaled to 32 cores → GPU utilization 90%, QPS ↑ 36%.

Triton Ensemble (all sub‑models on one machine) → negligible gain.

Micro‑service split with Triton → QPS ↑ 3.6×, GPU utilization 100%.

Increasing CPU alone reduced preprocessing latency but CPU‑GPU transfer remained dominant, limiting QPS. The micro‑service split decoupled workloads, allowing independent scaling of CPU services and full GPU exploitation.

4. Generalized Efficient Inference Deployment Architecture

Architecture: split any model into CPU‑bound preprocessing/post‑processing components and a GPU‑bound backbone; deploy each as separate micro‑services; use a scheduler to chain them. Underlying frameworks can be TF‑Serving, Triton, etc.

CPU services can be horizontally scaled to match traffic, preventing CPU bottlenecks.

GPU services contain only the backbone, achieving near‑100% utilization.

Parallel CPU‑GPU data transfer and GPU computation reduces idle time.

Latency impact is minimal: classification service latency increased from 42 ms to 45 ms after micro‑service conversion (RPC via Thrift).

5. Summary

Model‑structure splitting and micro‑service deployment raised GPU utilization from ~40% to near 100% and increased QPS by roughly threefold for both single‑model classification and detection‑plus‑classification services. The approach is applicable to any GPU‑bound inference workload.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Performance OptimizationMicroservicesTensorRTGPUTritonmodel splittingvisual inference
Meituan Technology Team
Written by

Meituan Technology Team

Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.