Optimizing AI Inference with TorchServe: Tackling GPU Bottlenecks & Kubernetes
This article details a comprehensive engineering practice for optimizing AI inference services at ZhiZhuan, covering background analysis, selection of TorchServe over alternatives, GPU/CPU performance tuning, custom handlers, Torch‑TRT integration, and deployment on Kubernetes, with measured improvements in throughput and resource utilization.
Background
ZhiZhuan, a second‑hand e‑commerce platform, applied AI to search recommendation, intelligent quality inspection, and smart customer service. During deployment the team discovered under‑utilized GPU resources, excessive CPU preprocessing, and duplicated online/offline processing logic that increased development and debugging costs.
Problem and Solution Approach
Current Situation
The original inference architecture separated CPU and GPU into independent micro‑services. Pre‑processing ran on CPU, often becoming a performance bottleneck, while GPU inference was isolated. This design limited horizontal scaling of the CPU side.
Issues
Iterative efficiency: custom pre‑ and post‑processing logic written in various languages required separate development effort, slowing iteration for algorithm engineers who primarily use Python.
Network communication: high‑resolution images for quality inspection caused heavy inter‑service traffic, increasing latency.
Solution Idea
Framework Survey
The team evaluated three serving frameworks—Triton, TorchServe, and TensorFlow Serving—considering performance, supported frameworks, ease of use, and community support. All met performance requirements, but TorchServe offered the best integration with PyTorch, simpler custom handler development, and sufficient support for ONNX, making it the preferred choice.
Framework Selection Rationale
TorchServe tightly integrates with the PyTorch ecosystem, reducing conversion and configuration effort compared with Triton.
Long‑term roadmap favors a framework that can later support multiple back‑ends; TorchServe satisfies short‑term simplicity while leaving room for future expansion.
TorchServe Practice
Usage and Tuning
Typical workflow: package model weights and custom pre/post‑processing code into a .mar archive, register the archive with TorchServe, and handle inference requests that trigger image download, preprocessing, inference, and post‑processing.
torch-model-archiver --model-name your_model_name --version 1.0 \
--serialized-file path_to_your_model.pth \
--handler custom_handler.py \
--extra-files path_to_any_extra_filesCustom handlers inherit from BaseHandler and implement initialize, preprocess, inference, and postprocess. This mechanism saved roughly 32 person‑days of development effort.
Torch‑TRT Integration
To accelerate the model backbone, the team used torch‑tensorrt to compile the PyTorch model into a TensorRT engine, leveraging layer fusion, kernel auto‑tuning, and mixed‑precision execution.
import torch
import torch_tensorrt
model = torch.load('path_to_your_model.pth')
trt_model = torch_tensorrt.compile(
model,
inputs=[torch_tensorrt.Input((1, 3, 224, 224))],
enabled_precisions={torch.float32}
)
torch.save(trt_model, 'path_to_trt_model.pth')Benchmark results showed:
Base TorchServe: GPU 40‑80% utilization, CPU 20‑40%, QPS 10, 2 GB memory.
Torch‑TRT: GPU 10‑50% utilization, CPU 100%, QPS 17, 680 MB memory.
While throughput increased, CPU became the new bottleneck because preprocessing remained on CPU.
Pre‑ and Post‑Processing Optimization
The team replaced CPU‑bound OpenCV and pandas operations with their GPU‑accelerated counterparts (cvCuda and cuDF). Example code swaps cv2 calls for cv2.cuda APIs, moving image decoding, filtering, and matrix calculations to the GPU.
import cv2
import cv2.cuda as cvcuda
img = cv2.imread('your_image.jpg')
gpu_img = cvcuda.GpuMat(img)
gaussian_filter = cvcuda.createGaussianFilter(gpu_img.type(), -1, (5,5), 1.5)
blurred_gpu = gaussian_filter.apply(gpu_img)
blurred_img = blurred_gpu.download()
cv2.imshow('Blurred Image (cvCuda)', blurred_img)
cv2.waitKey(0)
cv2.destroyAllWindows()Performance tests on a node with 2 × Intel Xeon Platinum 8168 CPUs and 1 × NVIDIA A100 GPU showed a four‑fold increase in QPS (from 10 to 40) and a shift of GPU utilization to 60‑80% while keeping CPU usage around 60%.
Deployment on Kubernetes
TorchServe provides Helm charts for a lightweight, highly available Kubernetes deployment. The cluster includes a model‑store pod, monitoring via Prometheus and Grafana, and the TorchServe pod itself.
kubectl get pods
NAME READY STATUS RESTARTS AGE
model-store-pod 1/1 Running 0 4h35m
torchserve-7d468f9894-fvmpj 1/1 Running 0 4h33m
... (other monitoring pods) ...This setup enables automatic failover, load balancing, rolling updates, and secure configuration.
Future Work
The current solution balances development speed and system performance but still faces challenges such as CPU saturation during heavy preprocessing and the inability to achieve a true “write‑once, run‑anywhere” pipeline. Future plans include supporting multi‑model inference, LLM serving, and extending the cloud‑native platform beyond the initial prototype.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Sohu Tech Products
A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
