Artificial Intelligence 12 min read

Optimizing Resource Utilization of 58.com Deep Learning Platform: Practices and Techniques

This article details how 58.com’s end‑to‑end deep‑learning platform was optimized for higher CPU and GPU inference performance using Intel MKL, OpenVINO, mixed TensorFlow deployment, GPU virtualization, and a Prometheus‑Grafana monitoring system, achieving a 37% reduction in GPU usage and a 146% increase in average GPU utilization.

58 Tech
58 Tech
58 Tech
Optimizing Resource Utilization of 58.com Deep Learning Platform: Practices and Techniques

58.com’s Deep Learning Platform integrates development experiments, model training, and online prediction into a one‑stop AI development environment that supports search, recommendation, image, NLP, speech, and risk‑control applications across the company.

The platform’s architecture consists of a resource layer (GPU, CPU), a storage layer (WFS, HDFS, MySQL), a cluster management layer built on Kubernetes, Docker, and Nvidia‑Docker, an algorithm layer with TensorFlow, PyTorch, and Caffe, and a user‑access layer providing web‑based task management, model deployment, and resource monitoring.

Operational challenges identified include low CPU inference performance requiring GPU migration, low GPU utilization for small‑traffic models, inability of Kubernetes to schedule GPU resources at a granularity finer than a whole card, and outdated resource configurations.

To improve CPU inference, the team integrated Intel’s MKL‑DNN library into TensorFlow Serving, creating a TensorFlow‑Serving‑MKL image that automatically applies graph optimizations, resulting in reduced latency and higher CPU usage efficiency, especially for OCR models.

For GPU inference, Intel OpenVINO was incorporated: model optimizer converts TensorFlow/PyTorch models to IR files (XML/BIN), which are then served via OpenVINO Model Server inside an init‑container, enabling accelerated inference on CPUs, GPUs, or FPGAs.

GPU utilization was further enhanced through two approaches: (1) TensorFlow mixed‑deployment, which packs multiple low‑traffic models into a single serving pod, allowing shared GPU resources; (2) GPU virtualization using vGPU technology via the open‑source GPU‑Manager plugin, which slices a physical GPU into 100 logical units and allocates them per pod, dramatically increasing card sharing capability.

A monitoring and alerting system based on Prometheus and Grafana collects per‑pod CPU/GPU metrics, filters relevant tasks, and notifies owners via SMS, WeChat, or email, prompting timely resource re‑allocation.

After applying these optimizations, the platform reduced GPU consumption by 37% and increased average GPU utilization by 146%, demonstrating the effectiveness of the combined CPU and GPU performance enhancements and resource‑aware scheduling.

deep learningkubernetesresource optimizationTensorFlowGPU virtualizationOpenVINOIntel MKL
58 Tech
Written by

58 Tech

Official tech channel of 58, a platform for tech innovation, sharing, and communication.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.