Artificial Intelligence 11 min read

Deep Learning Platform on Kubernetes: Architecture, Resource Management, Offline Training and Online Inference

The article presents a comprehensive overview of 58.com’s AI platform built on Kubernetes, detailing its layered architecture, resource scheduling, offline training pipelines, debugging environment, distributed TensorFlow/PyTorch training, performance benchmarks, and online inference services, highlighting how the system empowers various business units with scalable AI capabilities.

58 Tech
58 Tech
58 Tech
Deep Learning Platform on Kubernetes: Architecture, Resource Management, Offline Training and Online Inference

Background AI algorithms improve efficiency and user experience, driving industry transformation; 58.com accelerates AI adoption across its services.

Platform Overview The Wuba Platform of AI (WPAI) provides deep learning, traditional machine learning, and vector retrieval capabilities, serving all business units.

Overall Architecture

1. Web Management Layer : Visual UI for resource requests, task, model, log, and resource monitoring. 2. Algorithm Layer : Integrates TensorFlow and PyTorch, supports DNN/CNN/RNN on CPU/GPU, single‑node and distributed training. 3. Cluster Management Layer : Uses Kubernetes, Docker, Nvidia‑Docker to schedule training and inference pods, managing CPU/GPU resources. 4. Hardware Layer : GPU models (K40, P40, T4, 2080Ti) managed by Kubernetes. 5. Image Registry : Stores TensorFlow, PyTorch, TensorFlow‑Serving images. 6. Log Center : Stores training and inference logs. 7. Monitoring Center : Prometheus + Grafana for pod/container metrics. 8. Online Inference Service : TensorFlow‑Serving, gRPC, and 58’s SCF framework provide unified inference APIs.

Cluster Resource Management CPU, memory, and GPU resources (both offline and online) are unified under Kubernetes with ResourceQuota and PriorityClass, creating private pools for purchased resources and shared pools for dynamic allocation.

Offline Training Supports TensorFlow and PyTorch in debugging and training environments. Debugging uses Jupyter for online code editing and GPU debugging; training environment offers single‑node and distributed modes, with data stored in WFS and models in WOS. Workflow: upload data → debug with small samples → create training task → run training → view logs/tensorboard → obtain model and metrics.

Distributed Training TensorFlow distributes jobs via RC controllers and services; environment variables pass ps/worker addresses. PyTorch follows a similar pattern, broadcasting parameters across workers. Example: ResNet‑50 on ImageNet trained on 4 nodes (48 h, 75.1% accuracy) and 8 nodes (22.7 h, 73.2% accuracy).

Online Inference Provides a generic framework for diverse models. SCF service offers RPC entry, hot‑loads model protocol jars, and forwards requests to Kubernetes‑deployed inference instances (CPU/GPU). TensorFlow‑Serving and PyTorch (via gRPC/Seldon) handle model loading, load balancing, and scaling.

Optimization Measures GPU inference accelerated with TensorRT and mixed‑model deployment on a single GPU; CPU inference uses Intel MKL‑DNN, reducing latency by ~50% for OCR and low‑quality text recognition.

Conclusion The platform’s architecture, resource management, training, and inference capabilities enable 58.com’s business units to efficiently develop and deploy AI models, with ongoing enhancements planned under an open‑collaboration principle.

Deep Learningkubernetesresource managementTensorFlowPyTorchdistributed trainingOnline InferenceAI Platform
58 Tech
Written by

58 Tech

Official tech channel of 58, a platform for tech innovation, sharing, and communication.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.