Operations 13 min read

Alibaba IDC and Network Monitoring System Architecture and Practices

The article details Alibaba's globally distributed IDC and network monitoring systems, describing their fully distributed data collection, centralized computation, storage strategies, alarm mechanisms, and frontend visualization that together enable real‑time infrastructure and network health management for large‑scale operations.

Alibaba Cloud Infrastructure

Mar 15, 2017

Alibaba IDC and Network Monitoring System Architecture and Practices

1. IDC Monitoring System

Alibaba's global IDC infrastructure supports massive e‑commerce events; its monitoring must provide full‑chain perception from power and cooling to server metrics, using fully distributed data collection and centralized computation with multi‑datacenter disaster recovery.

System Architecture

Data flows are split into IT‑related and infrastructure‑related branches, as illustrated in the diagram.

Collection

The system builds a physical information tree covering every data‑center, room, rack, and server, and deploys monitoring agents (Master and Monitor) that schedule and execute metric collection scripts on each host.

Collected metrics are sent from servers up through the monitor agents to a central collector, completing the data ingestion pipeline.

Computation

Real‑time massive data is classified by business‑specific dimensions such as cabinet, rack, room, and data‑center to compute aggregated metrics like total power per cabinet or inlet temperature per room.

Alarm

For HVAC alarms, the system compares two consecutive temperature samples and triggers alerts not only on fixed thresholds but also on rapid temperature rise, allowing early warning before thresholds are breached.

2. Network Monitoring System

Alibaba's network comprises tens of thousands of devices across hundreds of data‑centers; rapid, accurate fault detection and convergence are essential for business continuity.

System Architecture

The network monitoring system consists of four parts: collection, computation, storage, and frontend. It gathers data such as PING, SNMP, SYSLOG, AAA, LVS, and ANAT, processes it to generate real‑time alerts, stores most data in HBase, and presents it via a web UI.

Collection

Cross‑PING domains are deployed in each security zone to probe devices in other zones, reducing false positives. Each collection node can probe up to 50,000 targets per second, and redundancy is provided through backup domains.

SNMP collection uses a single‑request‑per‑device policy with retry logic, automatically discovers ports and relationships, and aggregates metrics such as traffic, error packets, CPU, memory, and fan status.

Syslog, AAA, LVS, and ANAT logs are forwarded from network devices to collection agents via intermediate machines.

Computation

A custom distributed framework elects a master node to schedule tasks; failed masters trigger election, and tasks are re‑assigned. Computation aggregates best PING results, ranks SNMP port traffic, and consolidates sub‑port flows according to defined port sets.

Storage

While network‑quality data is stored at sub‑second granularity, most metrics are kept at minute granularity in HBase, with expectations of increased TPS as finer‑grained monitoring expands.

Frontend

The UI displays data‑center and device health, port status and traffic, custom port‑set flows, LVS/ANAT traffic, and alerts, and supports user‑defined dashboards.

Alarm

Real‑time analysis and convergence algorithms generate root‑cause alerts for link up/down, port and port‑set watermarks, error packets, Cisco FEX status, device board status, PING loss, OTN, etc.

Alibaba's intelligent monitoring system provides comprehensive real‑time visibility, enabling early risk detection and mitigation during large‑scale events such as Double‑11.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed systems infrastructure IDC

Written by

Alibaba Cloud Infrastructure

For uninterrupted computing services

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.