Artificial Intelligence 15 min read

Design and Implementation of the 58 Deep Learning Online Prediction Service

This article describes the architecture, components, and deployment strategies of the 58 deep learning online prediction service, covering TensorFlow‑Serving, custom model serving, traffic forwarding, load balancing, GPU configuration, resource monitoring, and the supporting web management platform.

58 Tech

Nov 21, 2018

Design and Implementation of the 58 Deep Learning Online Prediction Service

Deep learning is a core AI technology widely used in image, speech, NLP, search, and recommendation applications. Deploying models for online inference is critical, and the 58 AI Platform (WPAI) provides a unified online prediction service to streamline model rollout.

The overall architecture consists of six layers: SCF entry layer, model instance layer (TensorFlow‑Serving and custom model serving), Kubernetes management layer, resource layer (GPU/CPU/Memory/Network), storage layer (HDFS, MySQL, InfluxDB), and a web management system. Hardware resources are managed by Kubernetes, while the algorithm layer packages models using frameworks such as TensorFlow and Caffe.

TensorFlow Online Prediction Service is built on TensorFlow‑Serving, offering gRPC APIs, GPU‑accelerated inference, batching, version management, and distributed model support.

The SCF layer acts as the entry point, converting incoming requests into PredictRequest objects and translating PredictResponse objects back to client responses. Users provide custom JARs to implement request/response parsing for different model types.

Custom Model Prediction supports models not based on TensorFlow. It uses Docker containers and gRPC, allowing C++, Java, Python, and Go implementations. The workflow includes defining a gRPC interface, implementing the server logic, packaging it as a container image, and providing a JAR for SCF to translate the generic protocol.

Traffic Forwarding Design originally used Kubernetes Service for load balancing, incurring two hops. To reduce latency, SCF now forwards traffic directly to backend Pods using a weighted round‑robin algorithm, which dynamically adjusts node weights based on health checks.

The weighted round‑robin algorithm updates effectiveWeight using: effectiveWeight += (weight - effectiveWeight + 1) >> 1 and halves the weight on failures: effectiveWeight /= 2 Node changes are watched via the Kubernetes API to keep the candidate pool up‑to‑date.

Service Deployment leverages Kubernetes Deployment objects for rolling updates, version rollbacks, and resource limits. The web UI triggers deployment actions, which the backend executes through the Kubernetes Java client.

GPU Deployment Configuration requires mounting NVIDIA driver files inside the container (using nvidia-docker-plugin) and setting alpha.kubernetes.io/nvidia-gpu limits. Example commands:

docker pull nvidia/cuda:9.0-cudnn7-devel-ubuntu16.04 && nvidia-docker run nvidia/cuda:9.0-cudnn7-devel-ubuntu16.04 nvidia-smi

The driver files appear under /var/lib/nvidia-docker/volumes/nvidia_driver/ and are mounted to /usr/local/nvidia inside the container.

Service Resource Monitoring uses Heapster (with cAdvisor) to collect CPU, memory, network, and filesystem metrics, storing them in InfluxDB for visualization. GPU metrics are collected by a custom module inside each container and persisted to MySQL.

Web Management Platform provides task management, one‑click model deployment, and resource dashboards. Users can configure task details, resource limits, and enable GPU usage. The platform displays real‑time and historical resource usage charts for CPU, memory, network, and GPU.

In summary, the 58 deep learning online prediction service integrates TensorFlow and custom model serving, efficient traffic forwarding, robust deployment pipelines, comprehensive GPU‑aware monitoring, and a user‑friendly web UI, supporting billions of daily inference requests across recommendation, search, advertising, and other AI‑driven applications.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

load balancing GPU TensorFlow Serving online prediction

Written by

58 Tech

Official tech channel of 58, a platform for tech innovation, sharing, and communication.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.