Operations 12 min read

Network Quality Monitoring Center: Architecture, Design, and Implementation for Large-Scale Data Center Latency Measurement

The Network Quality Monitoring Center is a large‑scale system that deploys lightweight agents on every server to issue coordinated ICMP ping probes, a controller to generate and distribute topology‑aware PingLists, and a storage‑analysis module that aggregates latency and loss data for real‑time visualization, alerting and troubleshooting, while addressing load‑balance, ingestion concurrency, and future extensions such as UDP/TCP probes.

vivo Internet Technology

Sep 13, 2023

Network Quality Monitoring Center: Architecture, Design, and Implementation for Large-Scale Data Center Latency Measurement

Overview

The Network Quality Monitoring Center is a large‑scale system for measuring and analyzing network latency and packet loss in data‑center environments. Agents deployed on servers issue five ICMP Ping probes to target servers, collect end‑to‑end latency and loss metrics, and push the results to a storage and analysis module for aggregation and alerting. A controller distributes PingLists (the set of target IPs for each agent) via the internal data‑center messaging channel.

Background

Traditional three‑tier data‑center networks (core, aggregation, access) make fault isolation difficult as scale grows from dozens to tens of thousands of servers. Determining whether a performance issue originates from the network, a overloaded CPU, congestion, or packet loss requires systematic measurement. The Monitoring Center was created to simplify network operations and provide timely visibility into network health.

Components

Agent : Runs on each physical server, receives PingLists, performs Ping probes, and reports results. It must keep CPU usage below 5% while covering the majority of servers in the data center.

Controller : Acts as the task scheduler, generating PingLists based on topology, weighting ToR (Top‑of‑Rack) switches, and distributing them to agents. It also updates network topology periodically.

Storage & Analysis Module : Collects all Ping data, stores it, performs aggregation at 10‑minute and 1‑hour intervals, and produces visualizations and alerts.

Design & Implementation

PingList generation follows three principles to avoid O(n²) probing: random intra‑ToR pairs, two servers per ToR pinging servers in other ToRs, and cross‑data‑center probes from a few core‑ToR servers. This reduces probe traffic while ensuring coverage.

The controller consists of Timer, Watcher, Producer, Consumer, and Topology Keeper. Timer triggers updates; Watcher fetches topology metadata; Producer selects servers based on weighted ToR topology; Consumer delivers the PingList to agents; Topology Keeper discovers topology via SNMP.

Figures in the original document illustrate the overall architecture, controller internals, latency & packet‑loss overviews, real‑time latency matrices, and top‑10 latency trends.

Latency & Packet‑Loss Analysis

The analysis module aggregates Ping data, visualizes latency and loss across data centers, and highlights abnormal links for rapid troubleshooting. Real‑time matrix views help pinpoint problematic ToR links.

Challenges

• Agent deployment faces load‑imbalance on heavily utilized servers, limiting coverage. • High‑concurrency ingestion of Ping data strains the Web API and storage pipeline. • Component failures can cripple the entire monitoring pipeline. • Current ICMP‑only probing cannot precisely locate fault paths, and visualization dimensions are limited.

Future Optimizations

Planned improvements include supporting UDP/TCP probes, richer health‑check mechanisms, prioritized PingList generation based on server metadata, enhanced alert thresholds, and scaling the backend architecture to handle higher concurrency and data diversity.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed systems Operations Network Monitoring latency measurement ICMP ping

Written by

vivo Internet Technology

Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.