Alibaba IDC and Network Monitoring System Architecture and Practices
The article details Alibaba's globally distributed IDC and network monitoring systems, describing their fully distributed data collection, centralized computation, storage strategies, alarm mechanisms, and frontend visualization that together enable real‑time infrastructure and network health management for large‑scale operations.
1. IDC Monitoring System
Alibaba's global IDC infrastructure supports massive e‑commerce events; its monitoring must provide full‑chain perception from power and cooling to server metrics, using fully distributed data collection and centralized computation with multi‑datacenter disaster recovery.
System Architecture
Data flows are split into IT‑related and infrastructure‑related branches, as illustrated in the diagram.
Collection
The system builds a physical information tree covering every data‑center, room, rack, and server, and deploys monitoring agents (Master and Monitor) that schedule and execute metric collection scripts on each host.
Collected metrics are sent from servers up through the monitor agents to a central collector, completing the data ingestion pipeline.
Computation
Real‑time massive data is classified by business‑specific dimensions such as cabinet, rack, room, and data‑center to compute aggregated metrics like total power per cabinet or inlet temperature per room.
Alarm
For HVAC alarms, the system compares two consecutive temperature samples and triggers alerts not only on fixed thresholds but also on rapid temperature rise, allowing early warning before thresholds are breached.
2. Network Monitoring System
Alibaba's network comprises tens of thousands of devices across hundreds of data‑centers; rapid, accurate fault detection and convergence are essential for business continuity.
System Architecture
The network monitoring system consists of four parts: collection, computation, storage, and frontend. It gathers data such as PING, SNMP, SYSLOG, AAA, LVS, and ANAT, processes it to generate real‑time alerts, stores most data in HBase, and presents it via a web UI.
Collection
Cross‑PING domains are deployed in each security zone to probe devices in other zones, reducing false positives. Each collection node can probe up to 50,000 targets per second, and redundancy is provided through backup domains.
SNMP collection uses a single‑request‑per‑device policy with retry logic, automatically discovers ports and relationships, and aggregates metrics such as traffic, error packets, CPU, memory, and fan status.
Syslog, AAA, LVS, and ANAT logs are forwarded from network devices to collection agents via intermediate machines.
Computation
A custom distributed framework elects a master node to schedule tasks; failed masters trigger election, and tasks are re‑assigned. Computation aggregates best PING results, ranks SNMP port traffic, and consolidates sub‑port flows according to defined port sets.
Storage
While network‑quality data is stored at sub‑second granularity, most metrics are kept at minute granularity in HBase, with expectations of increased TPS as finer‑grained monitoring expands.
Frontend
The UI displays data‑center and device health, port status and traffic, custom port‑set flows, LVS/ANAT traffic, and alerts, and supports user‑defined dashboards.
Alarm
Real‑time analysis and convergence algorithms generate root‑cause alerts for link up/down, port and port‑set watermarks, error packets, Cisco FEX status, device board status, PING loss, OTN, etc.
Alibaba's intelligent monitoring system provides comprehensive real‑time visibility, enabling early risk detection and mitigation during large‑scale events such as Double‑11.
Alibaba Cloud Infrastructure
For uninterrupted computing services
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.