Operations 12 min read

Design and Implementation of Business‑Driven Monitoring Systems at JD Cloud

This article explains why monitoring is essential for operations, outlines the four‑layer monitoring standard (infrastructure, liveliness, performance, business), breaks down functional modules and data flows, and showcases JD Cloud's practical design, alarm‑convergence project, and future AI‑driven observability directions.

Qunar Tech Salon
Qunar Tech Salon
Qunar Tech Salon
Design and Implementation of Business‑Driven Monitoring Systems at JD Cloud

Monitoring is the lifeline of operations; its goal is to quickly discover, locate, and resolve issues to reduce MTTR. JD Cloud defines a unified monitoring standard covering four layers—basic, liveliness, performance, and business—to ensure comprehensive data collection and avoid gaps.

The article answers why operations rely on monitoring, typical system designs, and how business‑driven high‑availability monitoring differs, based on a JD Cloud and Intel joint online class presented by architect Yan Zhijie.

Key capabilities required for an effective monitoring system include data collection, data processing (aggregation, enrichment), anomaly detection with alerting, dashboard‑based diagnosis, rapid mitigation via a playbook platform, and high availability of the monitoring infrastructure itself.

Functionally, a typical monitoring system consists of modules for collection, computation, storage, alerting, algorithms, and business interfaces. Data flows from collection (agents, APIs, probes) into three processing streams: computation (e.g., aggregating Nginx PV), storage (for dashboards), and alerting (for anomaly detection). Advanced analytics provide root‑cause recommendations.

From a data‑centric perspective, monitoring is treated as a data‑processing pipeline: abstraction (CMDB), collection, aggregation, storage, and alerting. Standardizing data (metrics, logs, traces) enables reusable libraries (F(x)) across modules, simplifying design and enhancing reuse.

JD Cloud’s monitoring standards emphasize four layers: basic (CPU, memory, host health), liveliness (process, port status), performance (PV, latency, errors, capacity), and business (user‑side experience). Implementation includes scoring mechanisms for managers, one‑click configuration recommendations for operators, change‑visualization integration, and a standardized playbook platform.

A practical case study—alarm‑convergence project—shows how JD Cloud quantified alarm volume, introduced responsible‑owner dashboards, merged duplicate alerts, and provided mobile‑friendly views, dramatically improving alert relevance.

Looking ahead, the speaker envisions moving from rule‑based processing to statistical and machine‑learning methods, leveraging cloud‑native observability, semantic log analysis, multi‑metric correlation, and AI‑driven automation to close the monitoring loop.

MonitoringoperationsData ProcessingobservabilitycloudAlert ConvergenceJD Cloud
Qunar Tech Salon
Written by

Qunar Tech Salon

Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.