Scaling JD’s AI Platform: 5K+ Containers, GPU Management, and Multi‑Tenant Kubernetes
Since September 2016, JD’s AI foundation platform has leveraged Docker and Kubernetes to build a scalable machine‑learning infrastructure that now runs over 5,000 container instances, supports more than 20 AI services, and provides unified GPU, storage, networking, and multi‑tenant capabilities for both inference and training workloads.
Architecture Overview
From September 2016, JD AI Foundation Platform has built its machine‑learning platform on Docker and Kubernetes, continuously improving networking, GPU management, storage, logging, monitoring, and permission control. The platform now manages over 5K container instances, runs 20+ AI inference services (50+ APIs), and supports back‑end training, delivering stable performance during large‑scale events.
Core Stack
The foundation is centered on Docker + Kubernetes, with underlying resources including CPU, GPU, FPGA, InfiniBand, OPA high‑speed networks, and various file systems. On top sit machine‑learning frameworks and algorithm libraries, followed by business applications. Management components cover permission, task, workflow, monitoring, and logging services.
Design Principles
The platform follows a "Kubernetes schedules everything" philosophy, treating inference apps and training jobs uniformly. Key characteristics include high availability, load balancing, application packaging and isolation, automatic scaling, support for big‑data tools (TensorFlow, Caffe, XGBoost, MXNet, Hadoop, Spark), rich hardware resource types (CPU, GPU, FPGA, InfiniBand, OPA), full cluster resource utilization, data isolation for security, and multi‑tenant isolation at network, filesystem, and kernel levels.
Networking
Third‑party CNI plugins were evaluated; Calico was chosen for its BGP‑based routing without NAT, offering performance comparable to bare‑metal. NetworkPolicy and Calico’s extended policies provide both ingress and egress controls, enabling per‑user namespaces and fine‑grained rules. For external RPC exposure, the Contiv VLAN mode (underlay network) was adopted, delivering near‑bare‑metal performance.
Storage
Kubernetes storage is provided via plugins; GlusterFS was selected for elastic, horizontally scalable file storage, while SeaweedFS handles small‑file workloads (e.g., image data for training). HDFS remains essential, accelerated by Alluxio caching. Kerberos and Ranger enforce multi‑tenant security for HDFS, and GlusterFS volumes are mounted per‑container to isolate users.
GPU Resource Management
Running on Kubernetes 1.4 (pre‑multi‑GPU support), a custom GPU manager was developed to handle detection, driver mapping, health checks, and GPU‑aware scheduling based on model, memory, and availability, maximizing GPU utilization.
Load Balancing
Inference services expose RPC and HTTP interfaces; RPC uses a service registry for load balancing, while HTTP relies on the Kubernetes Ingress controller (Nginx) to route traffic to pods.
CI/CD Pipeline
GitLab, Jenkins, and Harbor constitute the CI/CD stack. Code is pushed to GitLab, Jenkins builds and packages Docker images, which are stored in Harbor and deployed to the Kubernetes cluster.
Logging and Monitoring
Logging follows the EFK stack: container stdout → Docker daemon → Fluentd → Kafka → Fluentd → Elasticsearch → Kibana. Kafka provides buffering and enables downstream consumption. Monitoring uses Heapster, InfluxDB, and Grafana, with custom extensions to aggregate service‑level metrics.
Spark on Kubernetes
Native Spark scheduling on Kubernetes is described, highlighting the drawbacks of running Spark Standalone inside Docker (performance loss, lack of multi‑tenant isolation) and presenting a design where Driver and Executor run as separate containers, benefiting from namespace isolation and version flexibility.
Compute‑Data Separation
The platform adopts a centralized storage model, separating compute from data using high‑speed networks (25 GbE, RDMA, SPDK). Benchmarks show only ~3% performance loss for MLlib algorithms when decoupling storage from compute, enabling a unified architecture for diverse AI/ML frameworks.
Conclusion
Kubernetes provides a cloud‑native foundation for JD’s AI platform, supporting multi‑tenant, GPU‑aware, and big‑data workloads with robust networking, storage, and operational tooling.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
