External Network Quality Monitoring System at 360: Architecture, Features, and Alert Strategies
The article details 360's external network quality monitoring system, explaining its background, real‑time detection features, CDN‑based source host selection, three‑layer architecture, data collection and storage pipelines, fault‑diagnosis strategies, and visualization approaches for rapid network fault localization.
Background Introduction
When users access 360 servers, traffic traverses ISP links, provincial data centers, and finally reaches the VIP; any failure in these segments can cause connectivity issues. The article explores how an external network quality monitoring system enables rapid, accurate fault localization for operations teams.
System Features
Real‑time detection at minute granularity.
End‑to‑end continuous monitoring reflecting true user‑to‑VIP network quality.
Full coverage of all provinces across the three major ISPs.
On‑demand task dispatch with flexible integration.
Proactive alarm generation for fast fault identification and impact assessment.
Multi‑perspective visualizations (CDN‑to‑VIP, ISP‑to‑VIP, data‑center‑to‑data‑center latency, packet loss, etc.).
System Framework
Source Host Selection – CDN Machines
To perform external probing, CDN nodes are chosen as source hosts because user machines cannot serve as probes and CDN nodes provide a strategic position to differentiate VIP, data‑center, and ISP faults.
System Principle
CDN machines periodically ping VIPs, store results in a time‑series database, and a fault‑determination module analyzes the data to generate alerts based on packet loss and average latency metrics.
System Architecture Diagram
The overall architecture consists of three layers: Presentation, Data Collection, and Data Analysis.
Presentation layer: Grafana and a custom web UI (e.g., Watchman) for visualization.
Data collection layer: Uses internal Wonder‑Agent and a big‑data gateway; ~100,000 agents launch ping modules based on host eligibility and pull VIP lists from the gateway.
Data analysis & alarm layer: InfluxDB‑HA provides minute‑level time‑series data; analysis modules apply fault‑judgment logic and generate alerts.
Task Dispatch
The Center gateway generates VIP lists for each CDN machine, partitions them to avoid overlap within the same data center, and agents pull these lists hourly, reducing ping storms and data‑collection pressure.
Agent Data Collection
Agents execute ping operations periodically to obtain VIP availability and report results to the gateway.
Storage
InfluxDB time‑series database for real‑time metrics (latency, loss, etc.).
MongoDB for aggregated fault data (ISP, VIP, data‑center failures).
MySQL for whitelist data (VIP whitelisting).
In‑memory metadata (province‑VIP, data‑center‑VIP mappings).
Alarm Strategy
Fault Strategies
Three fault‑judgment rules are applied:
VIP fault: One or more VIPs with >50% packet loss in >50% of provinces for a given ISP triggers an alert.
Data‑center fault: ≥2 ISPs affected or ≥⅓ of VIPs in a data center show >50% loss.
ISP fault: ≥⅓ of provinces or ≥⅓ of VIPs under an ISP show >50% loss (higher thresholds for stronger confidence).
Data Analysis & Alerting
1) Smooth filtering removes spike data. 2) Apply fault strategies on the cleaned dataset. 3) Generate alert content based on business aggregation. 4) Send alerts to designated business names or department IDs. 5) Allow whitelisting to suppress non‑critical alerts.
Visualization
Metrics displayed include average, maximum, minimum latency and packet loss from CDN machines to VIPs, as well as nationwide average latency per province per ISP.
VIP Whitelisting
Whitelist management allows certain VIPs to be excluded from alerting.
Alert Content
Sample alert screenshots illustrate the format and details sent to stakeholders.
System Planning
Future directions include on‑demand detection task APIs, support for TCP/ICMP/HTTP ping, anomaly detection via machine‑learning algorithms, integration with StackStorm for automatic VIP switching, and seven‑layer URL latency and status monitoring.
Provide on‑demand detection task interface (currently only passive).
Support multiple probing methods (TCP ping, ICMP ping, HTTP ping).
Add anomaly detection using machine‑learning for smarter latency analysis.
Integrate StackStorm for automatic VIP failover.
Monitor seven‑layer URL access latency and status.
360 Quality & Efficiency
360 Quality & Efficiency focuses on seamlessly integrating quality and efficiency in R&D, sharing 360’s internal best practices with industry peers to foster collaboration among Chinese enterprises and drive greater efficiency value.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.