How Proactive Link Monitoring Transforms Cloud Network Reliability
This article explains Huawei Cloud Stack's proactive link monitoring system, detailing its point‑line‑plane architecture, golden metrics of packet loss and latency, detection techniques, system components, and key innovations such as strategy optimization, alarm aggregation, and visualized performance dashboards for cloud data‑center networks.
Background
In cloud data‑center environments, IAAS cloud networking is the communication foundation for all services. Reliable cloud networking requires comprehensive, high‑performance, real‑time monitoring that covers every forwarding element, path, and service. Existing tools cannot meet all monitoring demands, leading to undetected faults in real deployments.
Typical Failure Cases
Case 1: After upgrading cloud platform network components, a hidden endpoint forwarding element malfunction caused traffic interruption for a low‑probability scenario, resulting in a business outage lasting over an hour.
Case 2: A physical network change introduced a routing rejection issue that was invisible to physical‑network monitoring but caused a two‑hour service disruption.
Case 3: Sudden traffic burst from a tenant saturated a gateway, raising latency for other tenants by up to 20 ms; the network monitor failed to detect the delay until other tenants reported failures.
All cases share the characteristic that individual element metrics appear normal while the combined network service experiences problems.
Proactive Link Monitoring System (Point → Line → Plane)
The system monitors:
Point: Physical and software elements, tracking CPU, memory, packet I/O, error/packet loss, forwarding tables, and resource usage to verify KPI health.
Line: Physical links, virtual links, and tenant traffic flows.
Plane: Unified view of cloud services, consolidating management, data, and tenant components into a single monitoring pane.
By automatically generating monitoring objects based on service topology, the system can detect faults quickly, turning uncertain fault detection into deterministic identification and reducing fault‑localization time from hours to minutes.
Golden Metrics: Packet Loss and Latency
Packet loss and latency directly reflect network forwarding capability and user experience. High loss leads to retransmissions and instability; high latency causes sluggish applications. These metrics must be measured actively or passively across devices, fabrics, data‑center, and cross‑DC links.
Link Detection Techniques
Traditional black‑box probing (ICMP/TCP) only reports end‑to‑end reachability and cannot pinpoint internal failures when services remain reachable. Colored‑packet probing mirrors traffic at each element, capturing per‑node packet counts and delays, enabling precise fault isolation.
System Architecture
The architecture consists of a Server side and multiple Agents.
Monitoring Scenarios: Daily continuous monitoring and upgrade‑specific monitoring.
Network Topology: Full map of switches, compute nodes, and software‑element ports.
Strategy List: Five‑tuple definitions (source IP, destination IP, protocol, source port, destination port) for each probe.
Probe Controller: Dispatches probing tasks based on the strategy list.
Probe Analyzer: Collects results, feeds back to refine strategies for better coverage.
Probe Agent: Injects colored packets and captures mirrored traffic on each node.
ERSPAN: Physical switches mirror colored packets to the Analyzer for unified analysis.
Key Innovations
1. Strategy Optimization
Initial strategies may miss certain elements; the Analyzer iteratively refines the five‑tuple set based on probe feedback, eventually covering all elements or revealing permanently isolated nodes.
2. Alarm Aggregation
When a single element fails, many probes converge on it, generating duplicate alarms. The system aggregates these into a single fault event, reducing noise and speeding up root‑cause identification.
3. Visualized Metrics
Dashboards display latency and loss for virtual links and individual elements. Interactive topology maps highlight healthy paths in green and problematic ones in red, allowing users to drill down into specific time windows (30 min, 1 h, 1 day, 1 month).
Overall, proactive link monitoring enhances cloud network observability by covering both service‑level and element‑level SLAs, enabling early detection of performance degradation and ensuring high‑quality user experiences.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Huawei Cloud Developer Alliance
The Huawei Cloud Developer Alliance creates a tech sharing platform for developers and partners, gathering Huawei Cloud product knowledge, event updates, expert talks, and more. Together we continuously innovate to build the cloud foundation of an intelligent world.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
