Operations 10 min read

External Network Quality Monitoring System at 360: Architecture, Features, and Alert Strategies

The article details 360's external network quality monitoring system, explaining its background, real‑time detection features, CDN‑based source host selection, three‑layer architecture, data collection and storage pipelines, fault‑diagnosis strategies, and visualization approaches for rapid network fault localization.

360 Quality & Efficiency

Feb 28, 2020

External Network Quality Monitoring System at 360: Architecture, Features, and Alert Strategies

Background Introduction

When users access 360 servers, traffic traverses ISP links, provincial data centers, and finally reaches the VIP; any failure in these segments can cause connectivity issues. The article explores how an external network quality monitoring system enables rapid, accurate fault localization for operations teams.

System Features

Real‑time detection at minute granularity.

End‑to‑end continuous monitoring reflecting true user‑to‑VIP network quality.

Full coverage of all provinces across the three major ISPs.

On‑demand task dispatch with flexible integration.

Proactive alarm generation for fast fault identification and impact assessment.

Multi‑perspective visualizations (CDN‑to‑VIP, ISP‑to‑VIP, data‑center‑to‑data‑center latency, packet loss, etc.).

System Framework

Source Host Selection – CDN Machines

To perform external probing, CDN nodes are chosen as source hosts because user machines cannot serve as probes and CDN nodes provide a strategic position to differentiate VIP, data‑center, and ISP faults.

System Principle

CDN machines periodically ping VIPs, store results in a time‑series database, and a fault‑determination module analyzes the data to generate alerts based on packet loss and average latency metrics.

System Architecture Diagram

The overall architecture consists of three layers: Presentation, Data Collection, and Data Analysis.

Presentation layer: Grafana and a custom web UI (e.g., Watchman) for visualization.

Data collection layer: Uses internal Wonder‑Agent and a big‑data gateway; ~100,000 agents launch ping modules based on host eligibility and pull VIP lists from the gateway.

Data analysis & alarm layer: InfluxDB‑HA provides minute‑level time‑series data; analysis modules apply fault‑judgment logic and generate alerts.

Task Dispatch

The Center gateway generates VIP lists for each CDN machine, partitions them to avoid overlap within the same data center, and agents pull these lists hourly, reducing ping storms and data‑collection pressure.

Agent Data Collection

Agents execute ping operations periodically to obtain VIP availability and report results to the gateway.

Storage

InfluxDB time‑series database for real‑time metrics (latency, loss, etc.).

MongoDB for aggregated fault data (ISP, VIP, data‑center failures).

MySQL for whitelist data (VIP whitelisting).

In‑memory metadata (province‑VIP, data‑center‑VIP mappings).

Alarm Strategy

Fault Strategies

Three fault‑judgment rules are applied:

VIP fault: One or more VIPs with >50% packet loss in >50% of provinces for a given ISP triggers an alert.

Data‑center fault: ≥2 ISPs affected or ≥⅓ of VIPs in a data center show >50% loss.

ISP fault: ≥⅓ of provinces or ≥⅓ of VIPs under an ISP show >50% loss (higher thresholds for stronger confidence).

Data Analysis & Alerting

1) Smooth filtering removes spike data. 2) Apply fault strategies on the cleaned dataset. 3) Generate alert content based on business aggregation. 4) Send alerts to designated business names or department IDs. 5) Allow whitelisting to suppress non‑critical alerts.

Visualization

Metrics displayed include average, maximum, minimum latency and packet loss from CDN machines to VIPs, as well as nationwide average latency per province per ISP.

VIP Whitelisting

Whitelist management allows certain VIPs to be excluded from alerting.

Alert Content

Sample alert screenshots illustrate the format and details sent to stakeholders.

System Planning

Future directions include on‑demand detection task APIs, support for TCP/ICMP/HTTP ping, anomaly detection via machine‑learning algorithms, integration with StackStorm for automatic VIP switching, and seven‑layer URL latency and status monitoring.

Provide on‑demand detection task interface (currently only passive).

Support multiple probing methods (TCP ping, ICMP ping, HTTP ping).

Add anomaly detection using machine‑learning for smarter latency analysis.

Integrate StackStorm for automatic VIP failover.

Monitor seven‑layer URL access latency and status.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Alerting CDN time_series_database fault detection network monitoring

Written by

360 Quality & Efficiency

360 Quality & Efficiency focuses on seamlessly integrating quality and efficiency in R&D, sharing 360’s internal best practices with industry peers to foster collaboration among Chinese enterprises and drive greater efficiency value.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.