How JD Built a Cloud‑Native Monitoring & Logging System for Massive Scale
This article explains the fundamental differences between traditional and cloud‑native monitoring systems, outlines the challenges each faces, and details JD.com's evolution from physical servers to JDOS 2.0, describing its modular architecture, deployment model, and ongoing optimization efforts.
Cloud Native Monitoring Systems: What Makes Them Different?
In the era where "cloud native" is a buzzword, enterprises often prioritize cloud‑native architectures for new internal systems. This article compares cloud‑native monitoring systems with traditional ones and shares insights from JD.com architect Han Chao on building a cloud‑native monitoring‑logging platform.
Why Cloud‑Native Monitoring Differs
Traditional monitoring aims at the same goals—observability, alerting, and tracing—but it operates within monolithic or VM‑based environments. Cloud‑native monitoring must be deployed and operated using Cloud‑Native principles, tightly integrated with Kubernetes (K8s) or PaaS platforms, and should avoid complex, invasive integrations.
Challenges of Traditional Monitoring
Scale: Must handle tens of thousands of hosts and millions of applications with O(1) operational effort.
Speed: Deployment of monitoring should match the rapid rollout of new hosts and services.
Reliability: Monitoring must be more timely and stable than the applications it protects.
Efficiency: CPU and memory overhead of the monitoring stack must be minimized.
Challenges Specific to Cloud‑Native Monitoring
Rapid deployment and auto‑scaling are required to match the characteristics of cloud‑native workloads.
In a cloud‑native PaaS, the collector agents risk higher intrusion because both the platform and applications are standardized.
Large‑scale systems need "extreme optimization" while still adhering to the standards of automation and standardization.
JD.com's Journey to a Cloud‑Native Monitoring Platform
Initially, JD.com ran all applications on physical machines, leading to resource waste, inflexible scheduling, and hours‑long migration during failures. Starting in 2014, JD adopted Docker on an OpenStack + NovaDocker stack, creating the first‑generation container engine JDOS 1.0. All applications moved into containers, with clusters of up to 10,000 nodes.
By 2016, JDOS 1.0 scaled to 100,000 nodes, prompting the launch of JDOS 2.0 and a migration from OpenStack to native Kubernetes for the JD.com Mall application stack.
Core Components of JD's Cloud‑Native Monitoring‑Logging System
Collector agents (deployed as K8s DaemonSets) – low‑intrusion, one per node.
Ingress proxies.
Storage module – a self‑developed high‑performance store built by JD's infrastructure team.
Computation module.
Service control center.
Alarm subsystem with highly customizable rules, channels (SMS, email, IM), and a rule‑engine.
These modules are exposed as services, allowing API‑driven extensions such as business‑specific plugins and AIOps integrations.
Deployment Model and Service‑Oriented Design
From the K8s perspective, the system runs as DaemonSet/ReplicaSet Pods—no changes to the underlying cluster.
Within the JDOS container platform, the monitoring‑logging system appears as a plug‑in, loosely coupled to the platform.
From the application viewpoint, it is a "zero‑perception" component that does not require developers to modify their code.
Operational Challenges and Solutions
Large‑scale upgrades are performed with temporary redundancy and dual‑write strategies to ensure seamless cut‑over. A "low‑intrusion" collector design avoids ordering constraints during rollout. The alarm system balances flexibility with noise reduction through automatic merging, timeline adjustments, and user‑driven rule configuration.
Key Strengths and Ongoing Optimizations
The system’s main advantages are a flexible architecture (supporting varied topologies, deployment models, and standards) and continuous evolution (maintaining low operational cost, supporting progressive upgrades, and adopting emerging technologies). Future work includes handling multi‑cluster, multi‑region deployments, integrating middleware observability (databases, caches, queues), and further optimizing storage efficiency and query performance.
Overall, JD.com's cloud‑native monitoring‑logging platform demonstrates how a large e‑commerce operation can transition from legacy monitoring to a modern, Kubernetes‑native observability stack that scales to hundreds of thousands of nodes while remaining extensible and low‑impact.
JD Cloud Developers
JD Cloud Developers (Developer of JD Technology) is a JD Technology Group platform offering technical sharing and communication for AI, cloud computing, IoT and related developers. It publishes JD product technical information, industry content, and tech event news. Embrace technology and partner with developers to envision the future.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.