Network Quality Monitoring Center: Architecture, Design, and Implementation for Large-Scale Data Center Latency Measurement
The Network Quality Monitoring Center is a large‑scale system that deploys lightweight agents on every server to issue coordinated ICMP ping probes, a controller to generate and distribute topology‑aware PingLists, and a storage‑analysis module that aggregates latency and loss data for real‑time visualization, alerting and troubleshooting, while addressing load‑balance, ingestion concurrency, and future extensions such as UDP/TCP probes.
Overview
The Network Quality Monitoring Center is a large‑scale system for measuring and analyzing network latency and packet loss in data‑center environments. Agents deployed on servers issue five ICMP Ping probes to target servers, collect end‑to‑end latency and loss metrics, and push the results to a storage and analysis module for aggregation and alerting. A controller distributes PingLists (the set of target IPs for each agent) via the internal data‑center messaging channel.
Background
Traditional three‑tier data‑center networks (core, aggregation, access) make fault isolation difficult as scale grows from dozens to tens of thousands of servers. Determining whether a performance issue originates from the network, a overloaded CPU, congestion, or packet loss requires systematic measurement. The Monitoring Center was created to simplify network operations and provide timely visibility into network health.
Components
Agent : Runs on each physical server, receives PingLists, performs Ping probes, and reports results. It must keep CPU usage below 5% while covering the majority of servers in the data center.
Controller : Acts as the task scheduler, generating PingLists based on topology, weighting ToR (Top‑of‑Rack) switches, and distributing them to agents. It also updates network topology periodically.
Storage & Analysis Module : Collects all Ping data, stores it, performs aggregation at 10‑minute and 1‑hour intervals, and produces visualizations and alerts.
Design & Implementation
PingList generation follows three principles to avoid O(n²) probing: random intra‑ToR pairs, two servers per ToR pinging servers in other ToRs, and cross‑data‑center probes from a few core‑ToR servers. This reduces probe traffic while ensuring coverage.
The controller consists of Timer, Watcher, Producer, Consumer, and Topology Keeper. Timer triggers updates; Watcher fetches topology metadata; Producer selects servers based on weighted ToR topology; Consumer delivers the PingList to agents; Topology Keeper discovers topology via SNMP.
Figures in the original document illustrate the overall architecture, controller internals, latency & packet‑loss overviews, real‑time latency matrices, and top‑10 latency trends.
Latency & Packet‑Loss Analysis
The analysis module aggregates Ping data, visualizes latency and loss across data centers, and highlights abnormal links for rapid troubleshooting. Real‑time matrix views help pinpoint problematic ToR links.
Challenges
• Agent deployment faces load‑imbalance on heavily utilized servers, limiting coverage. • High‑concurrency ingestion of Ping data strains the Web API and storage pipeline. • Component failures can cripple the entire monitoring pipeline. • Current ICMP‑only probing cannot precisely locate fault paths, and visualization dimensions are limited.
Future Optimizations
Planned improvements include supporting UDP/TCP probes, richer health‑check mechanisms, prioritized PingList generation based on server metadata, enhanced alert thresholds, and scaling the backend architecture to handle higher concurrency and data diversity.
vivo Internet Technology
Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.