Cloud Native 14 min read

JD.com’s Large‑Scale Kubernetes Refactoring and Operational Lessons

This article shares JD.com’s extensive experience redesigning Kubernetes for massive production use, covering custom DNS and load‑balancing, scaling clusters to ten‑thousand nodes, adapting controllers, building the Archimedes scheduler, and practical insights on resource isolation, deployment, and high‑traffic elasticity.

JD Tech

Jul 25, 2018

JD.com’s Large‑Scale Kubernetes Refactoring and Operational Lessons

Over the past year Kubernetes has surged in popularity due to its simple architecture and flexibility, and JD.com believes it will become a universal infrastructure standard. In late 2016 JD launched JDOS 2.0, a next‑generation container engine that migrated from OpenStack to a Kubernetes‑based stack, creating a complete and efficient PaaS platform.

The article describes JD’s deep reconstruction of Kubernetes in large‑scale production, including the development of a container‑friendly DNS, load‑balancer, file system, and image registry. JD built its own high‑performance DNS and LB to integrate with existing data‑center services, achieving query rates up to 8 million QPS, far surpassing traditional solutions.

To manage clusters of 8,000–10,000 nodes, JD adopted a “reduction‑first” approach, simplifying Kubernetes rather than adding features. They moved config‑maps out of etcd, introduced caching, and heavily refactored controllers, emphasizing strict testing before enabling any controller in large environments.

JD’s custom scheduler, named Archimedes, addresses resource‑usage peaks by shifting idle online‑service capacity to big‑data offline jobs, while acknowledging that Kubernetes alone cannot solve data‑center resource utilization. The team also implemented a local rebuild mechanism that bypasses the scheduler for urgent container migrations, improving deployment speed and stability.

Operational challenges such as API load spikes, node heartbeat reliability, and priority‑based eviction were tackled by building a dedicated eviction system that tags pods with priority, tolerations, and replica counts, ensuring critical workloads survive during resource contention.

Finally, JD reflects on lessons learned: Kubernetes is evolving toward an OpenStack‑like scale, requiring modularization (CNI, CRI, CSI), careful etcd management, and auxiliary health‑check systems. The experience highlights the need for custom extensions, performance‑aware scheduling, and pragmatic trade‑offs when operating Kubernetes at massive scale.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Cloud Native Kubernetes container infrastructure JDOS Large-scale

Written by

JD Tech

Official JD technology sharing platform. All the cutting‑edge JD tech, innovative insights, and open‑source solutions you’re looking for, all in one place.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.