Artificial Intelligence 27 min read

Resource Utilization Optimization Practices for the 58.com Machine Learning Platform (WPAI)

This article details the 58.com WPAI machine learning platform's architecture and the optimizations applied to training task scheduling, inference service elastic scaling, and offline‑online resource mixing, demonstrating how these techniques significantly improve GPU/CPU utilization and inference performance across both GPU and CPU environments.

58 Tech
58 Tech
58 Tech
Resource Utilization Optimization Practices for the 58.com Machine Learning Platform (WPAI)

Background – Since September 2017, 58.com has built the WPAI (Wuba Platform of AI) to provide a one‑stop machine learning development environment, integrating frameworks such as TensorFlow, PyTorch, Caffe, and PaddlePaddle, and managing GPU/CPU/NPU resources for both offline training and online inference.

WPAI consists of a basic compute platform that centralizes hardware resource management and an algorithm application platform (including WubaNLP, Phoenix image platform, and ranking learning platform) that offers ready‑to‑use models via a web interface, greatly improving developer productivity.

Additional subsystems such as the vector search platform vSearch and the AB testing platform SunDial further enhance AI engineering efficiency.

Problem Statement – The platform suffered from low resource utilization due to peak‑valley patterns in inference services, resource contention between departments, over‑provisioned requests, and uneven GPU model usage, leading to high operational costs.

To address these issues, WPAI introduced four optimization areas: training task resource scheduling, inference service elastic scaling, offline‑online resource mixing, and model inference acceleration.

1. Training Task Resource Scheduling – A unified quota system records purchased and borrowed resources per department. The scheduler now automatically adjusts resource requests based on recent usage (e.g., GPU quota = 90% of the average of the last three runs, CPU request = 50% of recent peaks). Priority‑based pre‑emptive scheduling and heterogeneous GPU allocation further balance load, resulting in a 51% increase in GPU usage and 38% increase in CPU usage for offline training.

2. Inference Service Elastic Scaling – The system monitors real‑time metrics (GPU/CPU usage, memory, QPS, latency) and applies both rule‑based and XGBoost‑based predictive models. Scaling formulas are:

Ceil(y/expectRate*NodeNum) for scaling out and floor(y/expectRate*NodeNum) for scaling in, where y is the current usage rate.

Automatic scaling reacts to traffic spikes without manual intervention, while intelligent shrinkage reclaims idle resources during low‑traffic periods.

3. Offline‑Online Resource Mixing – Two strategies were evaluated; the chosen approach dynamically reassigns whole physical nodes between online inference and offline training based on quota thresholds (minRate, maxRate, expectRate). Nodes transition through offline unnormal and online unnormal states to ensure smooth migration of pods, achieving ~65% of offline jobs completing on reclaimed inference resources with only 1.5% kill rate.

4. Model Inference Acceleration – For GPU inference, TensorRT + Triton Inference Server (TIS) is used, converting models to ONNX, applying layer fusion, kernel auto‑tuning, and INT8 quantization, yielding up to 6.6× QPS improvement on T4 GPUs. For CPU inference, Intel OpenVINO optimizes models via the Model Optimizer and Model Server, with custom operator extensions, similar‑operator replacement, and model reshaping to handle dynamic input shapes, achieving up to 3× QPS increase and 70% latency reduction.

The entire workflow, including data collection via Prometheus, streaming through Kafka, real‑time processing with Flink, and storage in HDFS/Elasticsearch, is open‑sourced as the dl_inference project.

Conclusion – By integrating advanced scheduling, elastic scaling, resource mixing, and inference acceleration, WPAI markedly improves overall cluster utilization and reduces operational costs, while the open‑source tooling enables broader community adoption.

Author – Chen Xingzhen, Senior Backend Architect and Head of AI Platform at 58.com TEG AI Lab.

machine learningAIKubernetesInference Accelerationresource optimizationelastic scaling
58 Tech
Written by

58 Tech

Official tech channel of 58, a platform for tech innovation, sharing, and communication.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.