Operations 8 min read

External Network Quality Monitoring System at 360: Architecture, Features, and Alert Strategies

The article describes how 360 implements an external network quality monitoring system that uses CDN nodes as source hosts to perform minute‑level, end‑to‑end ping measurements, stores results in time‑series and other databases, analyzes them to detect VIP, data‑center or ISP faults, and generates visualized alerts and reports for operations teams.

360 Tech Engineering
360 Tech Engineering
360 Tech Engineering
External Network Quality Monitoring System at 360: Architecture, Features, and Alert Strategies

The article introduces the need for rapid fault localization when users access 360 servers through ISP links, provincial data centers, and VIP endpoints, and explains that a dedicated external network quality monitoring system was built to address this challenge.

Key system characteristics include minute‑level real‑time detection, end‑to‑end continuous monitoring across all three major ISPs nationwide, full coverage of provincial networks, on‑demand task dispatch, and proactive alerting with precise fault type and impact scope identification.

The monitoring framework selects CDN machines as source hosts because they are positioned between users and backend servers; these CDN nodes periodically ping VIP addresses, collect latency and packet‑loss metrics, and store them in a time‑series database.

Data collection is performed by Wonder‑Agent agents deployed on over 100,000 internal machines, which retrieve VIP lists from a central gateway, execute ping tasks, and report results to the gateway, avoiding ping storms and reducing load on both VIPs and agents.

Collected data are stored in four layers: InfluxDB for real‑time metrics, MongoDB for aggregated fault data, MySQL for VIP whitelist information, and in‑memory structures for metadata such as province‑VIP mappings.

Alert strategies classify faults into three categories—VIP faults, data‑center faults, and ISP faults—based on packet‑loss thresholds and the proportion of affected provinces or VIPs, triggering alarms when conditions are met.

Data analysis includes smoothing to filter out spikes, merging metric sets from multiple VIPs to form a clean dataset, applying the fault rules, generating alarm content, and supporting whitelist management.

Visualization is provided via Grafana and a custom web UI, displaying metrics such as average, maximum, and minimum latency and packet‑loss rates from CDN nodes to VIPs, as well as nationwide ISP‑to‑data‑center latency maps and VIP whitelist status.

The system roadmap aims to add on‑demand detection task APIs, support multiple probe types (TCP, ICMP, HTTP), incorporate machine‑learning‑based anomaly detection, integrate automatic VIP switching, and monitor seven‑layer URL latency and status.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

System ArchitectureAlertingCDNTime Series DatabaseNetwork Monitoring
360 Tech Engineering
Written by

360 Tech Engineering

Official tech channel of 360, building the most professional technology aggregation platform for the brand.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.