JDOS Operations Platform: Managing Millions of Containers at JD.com
The article describes how JD.com built and operates the JDOS Operations Platform to manage a multi‑million‑container Docker and Kubernetes fleet, detailing the challenges of massive scale, the architectural components such as the configuration center, operation center, inspection system, gossip‑based communication, and an intelligent alerting system that together enable efficient, automated, and reliable large‑scale container operations.
JD.com operates one of the world’s largest Docker and Kubernetes clusters, with millions of containers, achieving a fully containerized environment called “All in Containers”. Managing such scale with only two dedicated operators is made possible by the JDOS Operations Platform, the core system that safeguards the massive container fleet.
The rapid growth of JD’s container deployment since 2014 exposed severe management challenges: human efficiency dropped sharply, fault diagnosis time increased from 5 to 30 minutes, and traditional tools could not cope with a million‑plus container count.
To address these issues, JD developed the JDOS Operations Platform, focusing on three dimensions: online operations, environment standardization, and intelligent alerting. The platform’s functional tree is illustrated in Figure 3.
The system’s architecture (Figure 4) centers on a configuration center that stores all cluster metadata—OS versions, Kubernetes component versions, zone information, kernel parameters, hardware details, and usage purpose. Both the inspection system and operation center rely on this configuration data to enforce consistency.
The operation center handles node‑level actions such as version upgrades, log cleanup, password rotation, scaling, node recovery, and new cluster deployment, providing a UI‑driven workflow that eliminates manual command errors.
Monitoring and visualization are provided by an information display center, which shows resource usage, health status, alerts, load‑balancing traffic, DNS resolution, and real‑time scheduling information for each Kubernetes cluster.
Initially, JD’s inspection system was built on Ansible due to its simplicity and Python compatibility. However, as the cluster grew to tens of thousands of nodes, Ansible’s performance became a bottleneck, taking up to 40 minutes for a full inspection.
To overcome the single‑point‑of‑failure and latency issues, JD adopted a distributed inspection architecture based on the Gossip protocol, implementing a weakly consistent AP system using HashiCorp’s Serf. Serf’s optimized gossip algorithm provides rapid convergence—messages propagate across 100 k nodes in under 2 seconds.
Serf’s query mechanism enables execution of inspection scripts on target nodes and returns success/failure results via UDP, while failed nodes are re‑checked with Ansible for detailed diagnostics. This hybrid approach reduces a 10 k‑node inspection from 40 minutes to about 3 minutes.
Given the massive alert volume generated by millions of containers, JD introduced an intelligent alert system (Figure 8). Container and host metrics are collected by the Nodemonitor agent, stored in a TSDB, and then aggregated for smart alerting, which reduces alert noise and predicts root causes.
Alert convergence is achieved through multi‑level correlation analysis—from data center down to container level—allowing the decision system to identify the true source of an alarm, reduce duplicate alerts, and provide pre‑emptive fault predictions based on health scoring.
Overall, the JDOS Operations Platform combines configuration management, automated inspection, gossip‑based communication, and intelligent alerting to maintain high consistency, rapid fault detection, and efficient operations across JD’s ultra‑large container infrastructure.
JD Tech
Official JD technology sharing platform. All the cutting‑edge JD tech, innovative insights, and open‑source solutions you’re looking for, all in one place.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.