Artificial Intelligence 26 min read

How 58’s WPAI Platform Boosted AI Resource Utilization by Over 50%

This article details the design and optimization of 58.com’s WPAI machine learning platform, covering background, training‑task scheduling, elastic inference scaling, offline‑online resource mixing, and model‑inference acceleration, and shows how these techniques collectively raised GPU usage by 51% and CPU usage by 38% while cutting costs.

ITPUB

Apr 27, 2022

How 58’s WPAI Platform Boosted AI Resource Utilization by Over 50%

Background

WPAI (Wuba Platform of AI) built since Sep 2017 provides a unified compute platform managing GPU, CPU, NPU resources and integrates TensorFlow, PyTorch, Caffe, PaddlePaddle. It also offers ready‑made NLP, image and ranking models via a web UI.

Training Task Resource Scheduling

Original department‑quota and borrow‑resource mechanism caused over‑booking and low utilization. An offline training resource scheduler was introduced with three key strategies:

Automatic resource adjustment: GPU quota = max(average of last three GPU utilizations, 90% of peak memory usage); CPU limit = 50% of recent peaks, request = 50% of recent averages.

Priority‑based pre‑emptive scheduling: weighted scores (department resources = 1,000,000, borrowed = 1,000) with dynamic penalties based on utilization and wait time.

Heterogeneous GPU scheduling: prefers user‑selected GPU model, falls back to other models when unavailable.

Result: offline training cluster GPU utilization ↑51%, CPU utilization ↑38%.

Inference Service Elastic Scaling

Inference workloads show “peak‑valley” patterns. WPAI implements an automatic elastic scaling system that expands or contracts pods based on real‑time metrics and model‑based predictions.

Expansion formula: Ceil(y/expectRate * NodeNum) Contraction formula: floor(y/expectRate * NodeNum) Metrics are collected via Prometheus → Kafka → Flink, stored in HDFS. XGBoost models trained on one month of data forecast resource usage for the next 1‑5 minutes. Scaling policies consider CPU, GPU, memory, QPS, latency and error rates.

Offline‑Online Resource Mixing

Idle inference resources at night are handed over to training jobs using a whole‑machine handover approach. Physical nodes are grouped by resource type (CPU, P40, T4, etc.) and can be in online normal (serving inference) or offline normal (training) states. Transitions are triggered when usage crosses thresholds minRate (online→offline) or maxRate /pending pods (offline→online). An intermediate unnormal state prevents new scheduling during handover. Long‑running training tasks (>12 h) are placed in a fixed‑resource pool; short tasks use the mixed pool and can be pre‑empted by inference pods. Currently 65 % of offline training jobs complete on mixed resources with a 1.5 % kill rate.

Model Inference Acceleration

GPU acceleration uses TensorRT + Triton Inference Server. Models are converted to ONNX, then optimized via kernel fusion, precision calibration (FP16/INT8) and dynamic memory management. INT8 quantization can increase QPS up to 6.6× and reduce latency by up to 67 %.

CPU acceleration uses OpenVINO Model Optimizer and Model Server. Optimizations include layer fusion, group‑convolution merging and custom‑op extensions. Workflows handle unsupported ops by custom implementation, op substitution or input reshaping, achieving up to 3× QPS improvement and 70 % latency reduction.

All acceleration pipelines are open‑sourced at https://github.com/wuba/dl_inference.

Conclusion

The combined optimizations—training‑task scheduling, elastic inference scaling, offline‑online resource mixing, and hardware‑specific inference acceleration—significantly improve cluster resource utilization and reduce operational costs. Future work will refine these mechanisms and explore simultaneous mixing of offline and online workloads on the same physical machines.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

kubernetes inference acceleration Elastic Scaling Resource Scheduling GPU Utilization AI platform

Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.