Optimizing Resource Utilization of 58.com Deep Learning Platform: Practices and Techniques
This article details how 58.com’s end‑to‑end deep‑learning platform was optimized for higher CPU and GPU inference performance using Intel MKL, OpenVINO, mixed TensorFlow deployment, GPU virtualization, and a Prometheus‑Grafana monitoring system, achieving a 37% reduction in GPU usage and a 146% increase in average GPU utilization.
58.com’s Deep Learning Platform integrates development experiments, model training, and online prediction into a one‑stop AI development environment that supports search, recommendation, image, NLP, speech, and risk‑control applications across the company.
The platform’s architecture consists of a resource layer (GPU, CPU), a storage layer (WFS, HDFS, MySQL), a cluster management layer built on Kubernetes, Docker, and Nvidia‑Docker, an algorithm layer with TensorFlow, PyTorch, and Caffe, and a user‑access layer providing web‑based task management, model deployment, and resource monitoring.
Operational challenges identified include low CPU inference performance requiring GPU migration, low GPU utilization for small‑traffic models, inability of Kubernetes to schedule GPU resources at a granularity finer than a whole card, and outdated resource configurations.
To improve CPU inference, the team integrated Intel’s MKL‑DNN library into TensorFlow Serving, creating a TensorFlow‑Serving‑MKL image that automatically applies graph optimizations, resulting in reduced latency and higher CPU usage efficiency, especially for OCR models.
For GPU inference, Intel OpenVINO was incorporated: model optimizer converts TensorFlow/PyTorch models to IR files (XML/BIN), which are then served via OpenVINO Model Server inside an init‑container, enabling accelerated inference on CPUs, GPUs, or FPGAs.
GPU utilization was further enhanced through two approaches: (1) TensorFlow mixed‑deployment, which packs multiple low‑traffic models into a single serving pod, allowing shared GPU resources; (2) GPU virtualization using vGPU technology via the open‑source GPU‑Manager plugin, which slices a physical GPU into 100 logical units and allocates them per pod, dramatically increasing card sharing capability.
A monitoring and alerting system based on Prometheus and Grafana collects per‑pod CPU/GPU metrics, filters relevant tasks, and notifies owners via SMS, WeChat, or email, prompting timely resource re‑allocation.
After applying these optimizations, the platform reduced GPU consumption by 37% and increased average GPU utilization by 146%, demonstrating the effectiveness of the combined CPU and GPU performance enhancements and resource‑aware scheduling.
58 Tech
Official tech channel of 58, a platform for tech innovation, sharing, and communication.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.