Operations 9 min read

Alarm Convergence, Merging, and Self‑Healing in the 58 Monitoring Platform

The article describes how the 58 monitoring platform reduces alarm storms through alarm convergence, intelligent merging using Gini‑based decision trees, and automated self‑healing, thereby improving alert quality, cutting noise by about 70%, and helping engineers resolve incidents faster.

58 Tech

Mar 25, 2019

Alarm Convergence, Merging, and Self‑Healing in the 58 Monitoring Platform

The 58 monitoring platform safeguards system service stability by continuously monitoring network devices, servers, ports, processes, business clusters, and user‑side load times, operating 24/7 across the entire enterprise.

As monitoring coverage expands, the number of alerts rises, leading to duplicate, persistent, or correlated alerts that overwhelm engineers and hinder rapid problem resolution.

Alarm Convergence

1. Alarm Judgment – To curb alert storms at the source, the platform compares real‑time metrics against thresholds. It supports multi‑point judgment, allowing alerts only when a configurable number of consecutive points (n) or a certain number of points within a history window (m) exceed the threshold, reducing noise from transient spikes.

2. Alarm Interval and Escalation – Alerts are throttled using a minimum interval (e.g., 5 minutes) and a maximum repeat count (m). If the issue persists beyond a set period (e.g., 30 minutes), the alert is upgraded to a higher severity; after a day without resolution, a daily reminder is sent. The alert lifecycle includes states: PROBLEM, UPGRADE, REMINDER, and OK (recovery after three consecutive normal points).

Alarm Merging

When large‑scale incidents generate many alerts (e.g., high load across a cluster or network failures affecting many servers), the platform merges them to highlight common root causes. Instead of fixed merging rules, it adopts a decision‑tree approach that minimizes the Gini index to select the most discriminative dimension for splitting.

The Gini index measures impurity; a lower value indicates a more homogeneous group. The platform computes the Gini index for each dimension using the distribution of alert attributes, chooses the dimension with the smallest index, splits the data, and repeats recursively until a stopping condition is met, forming an alarm merging tree.

For example, 50 raw alerts are grouped into five merged alerts, reducing the total number of messages by about 70% while preserving essential information such as affected clusters and abnormal ratios.

Fault Self‑Healing

Common faults like disk‑space exhaustion or missing processes have predefined remediation scripts. When an alert triggers, the platform can execute a user‑defined recovery command; if the command succeeds, no alert is sent, otherwise the normal alert flow proceeds. This automation reduces engineer interruption and accelerates fault resolution.

Conclusion

The 58 monitoring platform is a critical component for ensuring service stability. By continuously evolving its monitoring architecture, it balances rapid detection, comprehensive coverage, and alert quality, leveraging advanced techniques such as alarm convergence, Gini‑based merging, and automated self‑healing to keep operations efficient and reliable.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

monitoring Operations self-healing alarm convergence alert merging

Written by

58 Tech

Official tech channel of 58, a platform for tech innovation, sharing, and communication.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.