Artificial Intelligence 14 min read

Scaling JD’s AI Platform: 5K+ Containers, GPU Management, and Multi‑Tenant Kubernetes

Since September 2016, JD’s AI foundation platform has leveraged Docker and Kubernetes to build a scalable machine‑learning infrastructure that now runs over 5,000 container instances, supports more than 20 AI services, and provides unified GPU, storage, networking, and multi‑tenant capabilities for both inference and training workloads.

21CTO

Sep 17, 2017

Scaling JD’s AI Platform: 5K+ Containers, GPU Management, and Multi‑Tenant Kubernetes

Architecture Overview

From September 2016, JD AI Foundation Platform has built its machine‑learning platform on Docker and Kubernetes, continuously improving networking, GPU management, storage, logging, monitoring, and permission control. The platform now manages over 5K container instances, runs 20+ AI inference services (50+ APIs), and supports back‑end training, delivering stable performance during large‑scale events.

Core Stack

The foundation is centered on Docker + Kubernetes, with underlying resources including CPU, GPU, FPGA, InfiniBand, OPA high‑speed networks, and various file systems. On top sit machine‑learning frameworks and algorithm libraries, followed by business applications. Management components cover permission, task, workflow, monitoring, and logging services.

Design Principles

The platform follows a "Kubernetes schedules everything" philosophy, treating inference apps and training jobs uniformly. Key characteristics include high availability, load balancing, application packaging and isolation, automatic scaling, support for big‑data tools (TensorFlow, Caffe, XGBoost, MXNet, Hadoop, Spark), rich hardware resource types (CPU, GPU, FPGA, InfiniBand, OPA), full cluster resource utilization, data isolation for security, and multi‑tenant isolation at network, filesystem, and kernel levels.

Networking

Third‑party CNI plugins were evaluated; Calico was chosen for its BGP‑based routing without NAT, offering performance comparable to bare‑metal. NetworkPolicy and Calico’s extended policies provide both ingress and egress controls, enabling per‑user namespaces and fine‑grained rules. For external RPC exposure, the Contiv VLAN mode (underlay network) was adopted, delivering near‑bare‑metal performance.

Storage

Kubernetes storage is provided via plugins; GlusterFS was selected for elastic, horizontally scalable file storage, while SeaweedFS handles small‑file workloads (e.g., image data for training). HDFS remains essential, accelerated by Alluxio caching. Kerberos and Ranger enforce multi‑tenant security for HDFS, and GlusterFS volumes are mounted per‑container to isolate users.

GPU Resource Management

Running on Kubernetes 1.4 (pre‑multi‑GPU support), a custom GPU manager was developed to handle detection, driver mapping, health checks, and GPU‑aware scheduling based on model, memory, and availability, maximizing GPU utilization.

Load Balancing

Inference services expose RPC and HTTP interfaces; RPC uses a service registry for load balancing, while HTTP relies on the Kubernetes Ingress controller (Nginx) to route traffic to pods.

CI/CD Pipeline

GitLab, Jenkins, and Harbor constitute the CI/CD stack. Code is pushed to GitLab, Jenkins builds and packages Docker images, which are stored in Harbor and deployed to the Kubernetes cluster.

Logging and Monitoring

Logging follows the EFK stack: container stdout → Docker daemon → Fluentd → Kafka → Fluentd → Elasticsearch → Kibana. Kafka provides buffering and enables downstream consumption. Monitoring uses Heapster, InfluxDB, and Grafana, with custom extensions to aggregate service‑level metrics.

Spark on Kubernetes

Native Spark scheduling on Kubernetes is described, highlighting the drawbacks of running Spark Standalone inside Docker (performance loss, lack of multi‑tenant isolation) and presenting a design where Driver and Executor run as separate containers, benefiting from namespace isolation and version flexibility.

Compute‑Data Separation

The platform adopts a centralized storage model, separating compute from data using high‑speed networks (25 GbE, RDMA, SPDK). Benchmarks show only ~3% performance loss for MLlib algorithms when decoupling storage from compute, enabling a unified architecture for diverse AI/ML frameworks.

Conclusion

Kubernetes provides a cloud‑native foundation for JD’s AI platform, supporting multi‑tenant, GPU‑aware, and big‑data workloads with robust networking, storage, and operational tooling.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

kubernetes Multi‑tenant GPU scheduling AI platform Container Orchestration

Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.