Scalable Monitoring for Massive Physical & Container Clusters: JD's MDC Insights
This article shares JD’s large‑scale monitoring system (MDC) design, covering its three‑tier architecture, agent‑based data collection, performance optimizations for SNMP/IPMI, low‑overhead deployment, high‑availability strategies, and practical lessons on scaling monitoring across thousands of physical machines and containers.
MDC Architecture Overview
The JD Monitoring Data Center (MDC) is a self‑developed platform that monitors both physical machines and containers. It provides metric collection, alerting, and reporting services to users. The system is divided into three logical components:
Dashboard – Web UI for visualizing metrics.
Controller – Manages collection tasks, schedules execution, and exposes data/report APIs.
Agent – Performs actual data collection, alert generation, and data processing.
A VIP layer in front of the Controller offers a unified entry point, simplifying Agent discovery and external API calls.
Agent Internal Architecture
The Agent is a logical unit that runs on a host or container and consists of four sub‑components communicating via an internal RabbitMQ message queue:
Central – Receives collection tasks and dispatches them to Sniper.
Sniper – Lightweight collector that only handles data acquisition. Different collection types are loaded as plugins.
Filter – Filters collected data and generates alerts.
Collector – Aggregates processed data, stores it in cache and persistent storage.
Physical‑machine metrics are gathered via SNMP and IPMI; container metrics are obtained through a DockerPull agent that exposes a RESTful API consumed by Sniper.
Massive‑Scale Monitoring Practices
Design goals for large‑scale workloads include high performance, low overhead, strong scalability, and high availability. The following practices were adopted:
Pull‑based collection model: SNMP/IPMI for physical hosts, DockerPull agents for containers.
Package all Agent sub‑components and RabbitMQ into a single logical Agent deployed on a host or container, reducing external network traffic.
Deploy Controller and Agents as Docker containers on an elastic cluster platform; the Controller monitors performance metrics and triggers automatic scaling via the platform’s API.
Performance Optimizations
SNMP : The pure‑Python pysnmp library could not meet concurrency requirements. Switching to the C‑based Net‑SNMP implementation dramatically increased throughput.
IPMI : Raw IPMI responses can be slow (minutes). A caching layer was added to the IPMI module; the cache is refreshed periodically and served instantly on request.
Slow Collection Handling : Two‑layer mitigation – adaptive degradation within a task to keep data usable, and task splitting with migration to preserve overall throughput.
Low‑Overhead Design
All Agent sub‑components and their RabbitMQ broker are co‑located on the same physical or container host. This eliminates the RabbitMQ bottleneck in large clusters and keeps inter‑component traffic internal, minimizing network load.
High‑Availability Strategies
Three levels of HA are implemented:
Process‑level: automatic restart of crashed processes.
Service‑level: the overall service remains available even if individual instances fail.
Collection‑task‑level: ensures no data loss when agents or tasks are disrupted.
Agents and Controllers run in Docker containers managed by the elastic cluster. Agents periodically report performance metrics to the Controller, which performs scaling actions based on predefined thresholds.
Operational Details
Container discovery automatically imports new containers into the system, creates or merges collection tasks, and assigns them to idle Agents. Agent upgrades are performed with an Ansible‑based tool.
RabbitMQ is used without any custom modifications; both producer and consumer reside on the same machine to avoid reliability issues.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
