Operations 18 min read

Design and Architecture of a Unified Alert Convergence System for Monitoring

The paper presents a unified alert convergence system that centralizes metric calculation, detection, and alarm handling across monitoring subsystems, employing mechanisms such as convergence, claiming, silencing, escalation, and a Redis‑based delayed queue integrated via Kafka or REST to reduce alarm fatigue, improve MTTA/MTTR, and enable future AI‑driven AIOps.

vivo Internet Technology
vivo Internet Technology
vivo Internet Technology
Design and Architecture of a Unified Alert Convergence System for Monitoring

Background: Monitoring systems consist of detection and alarm components. Detection discovers anomalies, while alarms notify responsible personnel. In the first generation (vivo monitoring 1.0), each subsystem maintained its own calculation, storage, detection, and alarm convergence logic, leading to data silos and limited applicability.

Previously, monitoring was divided into basic monitoring, generic monitoring, tracing, log monitoring, and synthetic monitoring. The goal of unified monitoring is to centralize metric calculation, storage, detection, alarm generation, and visualization.

Current Situation: In the 1.0 architecture, each subsystem performed alarm convergence and then interfaced with an old alarm center (see Figure 1). This resulted in duplicated rule sets and redundant functionality.

Key Concepts in the monitoring workflow:

Anomaly : When one or more metric values exceed a defined threshold within a detection window, an anomaly is generated. Example: a 6‑3 rule (6 data points, at least 3 exceed the threshold) creates an anomaly when satisfied.

Problem : A collection of similar anomalies occurring over a continuous period forms a problem. One problem may correspond to many anomalies.

Alarm : Notification (SMS, email, etc.) sent to users when a problem is detected.

Recovery : When a problem’s anomalies no longer meet the detection rule, the problem is considered recovered, and a recovery notice is sent.

Important operational metrics (Figure 2) include MTTD, MTTA, MTTF, MTTR, MTBF. The article focuses on MTTA (Mean Time To Acknowledge) and MTTR (Mean Time To Repair).

MTTA measures the average time the operations or development team takes to acknowledge an issue after it is detected. Formula (Figure 4): MTTA = Σ t[i] / Σ r[i] where t[i] is the response time for the i‑th incident and r[i] is the total number of incidents.

MTTR measures the average time to restore a service to normal operation after an incident. Formula (Figure 5): MTTR = Σ t[ri] / Σ r[i] where t[ri] is the total time from the i‑th alarm to service restoration and r[i] is the number of alarms for that service.

Issues such as alarm fatigue arise when massive numbers of alerts overwhelm operators, leading to “wolf‑howling” and reduced effectiveness.

Functional Design (Section 4) addresses the reduction of alarm volume and improvement of alarm accuracy through several mechanisms:

Alarm Convergence : Combine multiple alerts into a single notification using strategies such as first‑alert waiting time, alarm interval, anomaly convergence dimensions, and message merge dimensions (see Figure 8).

Alarm Claiming : Once an alert is claimed by a person, subsequent identical alerts are sent only to the claimant.

Alarm Silencing : Suppress alerts for a known problem during a defined period or during deployments.

Alarm Callback : Trigger a callback interface to automatically remediate the service when an alert fires.

False‑Alarm Tagging : Mark alerts as false positives to improve detection models.

Alarm Escalation : If an alert remains unresolved beyond a threshold, automatically elevate it to higher‑level personnel to reduce MTTA.

Architecture (Section 5) proposes a loosely coupled unified alarm service that can be integrated via Kafka or RESTful APIs. Core components include:

Convergence Service (receives anomalies, creates problems, stores them in MySQL).

Redis delayed queue to control when aggregated messages are released.

Message assembly module that formats the final alert.

Configurable dispatch channels (SMS, email, etc.).

Core Implementation (Section 6) emphasizes the use of a Redis delayed queue for alarm convergence. The queue orders problems by their expiration timestamp; a listener extracts expired items, assembles the final message, and dispatches it.

Future Outlook (Section 7) suggests enhancing the system with AI‑driven AIOps for smarter convergence and root‑cause analysis, while continuing to improve data flow and automation.

References are provided for MTTR/MTTA definitions, alarm management best practices, and Redis delayed queue implementations.

Monitoringsystem architectureoperationsRedisMTTRAlert ConvergenceMTTA
vivo Internet Technology
Written by

vivo Internet Technology

Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.