Design and Implementation of the 58 Deep Learning Online Prediction Service
This article describes the architecture, components, and deployment strategies of the 58 deep learning online prediction service, covering TensorFlow‑Serving, custom model serving, traffic forwarding, load balancing, GPU configuration, resource monitoring, and the supporting web management platform.
Deep learning is a core AI technology widely used in image, speech, NLP, search, and recommendation applications. Deploying models for online inference is critical, and the 58 AI Platform (WPAI) provides a unified online prediction service to streamline model rollout.
The overall architecture consists of six layers: SCF entry layer, model instance layer (TensorFlow‑Serving and custom model serving), Kubernetes management layer, resource layer (GPU/CPU/Memory/Network), storage layer (HDFS, MySQL, InfluxDB), and a web management system. Hardware resources are managed by Kubernetes, while the algorithm layer packages models using frameworks such as TensorFlow and Caffe.
TensorFlow Online Prediction Service is built on TensorFlow‑Serving, offering gRPC APIs, GPU‑accelerated inference, batching, version management, and distributed model support.
The SCF layer acts as the entry point, converting incoming requests into PredictRequest objects and translating PredictResponse objects back to client responses. Users provide custom JARs to implement request/response parsing for different model types.
Custom Model Prediction supports models not based on TensorFlow. It uses Docker containers and gRPC, allowing C++, Java, Python, and Go implementations. The workflow includes defining a gRPC interface, implementing the server logic, packaging it as a container image, and providing a JAR for SCF to translate the generic protocol.
Traffic Forwarding Design originally used Kubernetes Service for load balancing, incurring two hops. To reduce latency, SCF now forwards traffic directly to backend Pods using a weighted round‑robin algorithm, which dynamically adjusts node weights based on health checks.
The weighted round‑robin algorithm updates effectiveWeight using:
effectiveWeight += (weight - effectiveWeight + 1) >> 1
and halves the weight on failures:
effectiveWeight /= 2
Node changes are watched via the Kubernetes API to keep the candidate pool up‑to‑date.
Service Deployment leverages Kubernetes Deployment objects for rolling updates, version rollbacks, and resource limits. The web UI triggers deployment actions, which the backend executes through the Kubernetes Java client.
GPU Deployment Configuration requires mounting NVIDIA driver files inside the container (using nvidia-docker-plugin ) and setting alpha.kubernetes.io/nvidia-gpu limits. Example commands:
docker pull nvidia/cuda:9.0-cudnn7-devel-ubuntu16.04 && nvidia-docker run nvidia/cuda:9.0-cudnn7-devel-ubuntu16.04 nvidia-smi
The driver files appear under /var/lib/nvidia-docker/volumes/nvidia_driver/ and are mounted to /usr/local/nvidia inside the container.
Service Resource Monitoring uses Heapster (with cAdvisor) to collect CPU, memory, network, and filesystem metrics, storing them in InfluxDB for visualization. GPU metrics are collected by a custom module inside each container and persisted to MySQL.
Web Management Platform provides task management, one‑click model deployment, and resource dashboards. Users can configure task details, resource limits, and enable GPU usage. The platform displays real‑time and historical resource usage charts for CPU, memory, network, and GPU.
In summary, the 58 deep learning online prediction service integrates TensorFlow and custom model serving, efficient traffic forwarding, robust deployment pipelines, comprehensive GPU‑aware monitoring, and a user‑friendly web UI, supporting billions of daily inference requests across recommendation, search, advertising, and other AI‑driven applications.
58 Tech
Official tech channel of 58, a platform for tech innovation, sharing, and communication.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.