CPU-Based Optimization of Deep Learning Inference Services
To alleviate GPU scarcity, iQIYI’s cloud platform migrated deep‑learning inference to CPUs and applied system‑level (MKL‑DNN, OpenVINO), application‑level, and algorithm‑level optimizations—tuning threads, batch size, NUMA, pruning and quantization—delivering 1‑9× speedups across thousands of cores while preserving latency and accuracy.
Background
With the widespread adoption of artificial‑intelligence technology in iQIYI’s video business, the deployment of deep‑learning algorithms in the cloud has caused a rapid increase in demand for compute resources, especially GPUs. The cloud‑based deep‑learning platform team aims to improve deployment efficiency, reduce operating costs, and enable algorithm and business teams to launch AI services quickly.
From an infrastructure perspective, the main challenges are the scarcity of GPU resources and low GPU utilization. Heavy training and inference workloads often lead to GPU shortages, while CPU‑based inference suffers from performance limitations. Real‑time online services usually require exclusive GPU access, yet low QPS results in utilization often below 20%.
To address this, the team explored CPU‑based inference optimization, moving services from GPU to CPU to leverage abundant CPU servers and save GPU resources.
1. Deep Learning Inference Service and Optimization Process
1.1 What is a deep‑learning inference service?
An inference service deploys a trained deep‑learning model to the cloud and provides gRPC/HTTP interfaces. It includes model loading, version management, batch processing, multi‑stream support, and API encapsulation (see Figure 2).
Common frameworks include TensorFlow Serving, NVIDIA TensorRT Inference Server, and Amazon Elastic Inference. iQIYI’s Jarvis platform currently offers TensorFlow Serving‑based automatic deployment, supporting models such as TensorFlow, Caffe, Caffe2, MXNet, TensorRT, and future support for OpenVINO and PyTorch.
1.2 What is the service‑optimization workflow?
The workflow (Figure 3) is an iterative process that first defines service type and key performance indicators, then selects appropriate optimization methods based on whether the service is compute‑intensive or I/O‑intensive, latency‑sensitive or throughput‑oriented.
1.3 What performance metrics are relevant?
Key metrics include latency, throughput, and model accuracy (Figure 5). Latency and throughput are the primary concerns for service deployment.
2. CPU‑Based Inference Service Optimization
2.1 Methods at different levels
Optimization can be categorized into system‑level, application‑level, and algorithm‑level, each with corresponding analysis tools (Figure 6).
System‑level : Accelerate computation via SIMD‑based compiler optimizations, OpenMP‑enabled math libraries (e.g., MKL‑DNN), or vendor‑provided SDKs such as Intel OpenVINO.
Application‑level : Optimize concurrency, pipeline design, data pre‑/post‑processing, and request handling to improve end‑to‑end performance.
Algorithm‑level : Adjust hyper‑parameters, prune networks, or apply quantization to reduce model size and compute cost.
2.2 System‑level optimization practice
Two main approaches are used:
Math‑library optimization with MKL‑DNN (example shown in Figure 7).
Inference SDK optimization with Intel OpenVINO, which converts models to an intermediate representation and provides loading/inference APIs (Figure 8).
2.3 Choosing between the two approaches
Figure 9 compares the two methods. Typically, MKL‑DNN is tried first; if performance is insufficient, OpenVINO is employed.
2.4 Factors affecting system‑level performance
OpenMP parameters : Set OMP_NUM_THREADS to the number of container CPU cores, KMP_BLOCKTIME=10, KMP_AFFINITY=granularity=fine,verbose,compact,1,0.
CPU core count : Small batch sizes benefit from 8‑16 cores; large batch sizes scale linearly up to >20 cores.
CPU model : Newer SIMD extensions (e.g., AVX‑512 on Xeon Gold 6148) can double inference speed compared to older models.
Input data format : Prefer NCHW for image models; TensorFlow natively supports NHWC, but MKL‑DNN adds NCHW support.
NUMA configuration : Same‑node memory access can improve performance by 5‑10%.
2.5 Application‑level optimization
Identify bottlenecks via timestamps or profiling tools such as Intel VTune. Typical techniques include concurrent design, data prefetch, I/O acceleration, and hardware‑assisted codecs. A video‑quality assessment service example demonstrates how VTune reveals that OpenCV decoding dominates CPU usage (Figure 11), and how parallel decoding and batch processing improve throughput (Figure 12), halving processing time for a 720‑frame video.
2.6 Algorithm‑level optimization
Common methods are batch‑size tuning, model pruning, and quantization. Model‑level changes usually require collaboration with algorithm engineers to maintain accuracy.
2.7 Impact of batch size on CPU performance
Latency‑sensitive services use small batch sizes; throughput‑oriented services use larger batches. Figure 13 shows that increasing batch size from 1 to 2 improves throughput with minimal latency impact, while larger jumps (e.g., 8→32) yield diminishing returns on throughput but increase latency.
Conclusion and Outlook
The described system‑level optimizations have been deployed in more than ten applications, scaling to thousands of CPU cores and achieving 1‑9× performance gains. Future work includes adding heterogeneous accelerators (VPU, FPGA), improving elastic scheduling, and automating parameter selection to further accelerate deep‑learning inference on the cloud.
iQIYI Technical Product Team
The technical product team of iQIYI
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.