Operations 14 min read

How 360’s Unified Alert Service Boosts System Reliability and Cuts MTTR

This article explains the importance, pain points, architecture, core capabilities, and future roadmap of the 360 Zhihui Cloud "Yunzhou" unified alert service, showing how it improves observability, reduces alert noise, and accelerates incident response for modern cloud‑native systems.

360 Zhihui Cloud Developer

Feb 27, 2025

How 360’s Unified Alert Service Boosts System Reliability and Cuts MTTR

In observability systems, alerting is a crucial pillar for ensuring business system stability and reliability. Timely detection of anomalies or potential risks enables rapid notification of administrators and operators, helping to prevent issue escalation and lower MTTR.

Background

1.1 Importance of Alerting Service

Business systems aim to keep stable operation by collecting monitoring data through instrumentation and push mechanisms, visualizing it for operators. When anomalies appear, operators intervene to locate and fix problems, completing a data‑>person‑>action loop. However, expecting 99% of users to monitor data 24/7 is unrealistic; alerts bridge this gap by notifying users when failures occur.

1.2 Pain Points of Alerting Service

As systems become more complex, alert overload becomes a major issue: excessive or irrelevant alerts drown out critical ones, reducing efficiency and risking missed failures. In short, "too many alerts are equivalent to no alerts."

The Yunzhou observability product designs a high‑efficiency unified alert system that is timely, comprehensive, non‑redundant, and directed to the right personnel.

Alert Service Architecture

Data Collection: The system gathers metrics from tens of thousands of servers, dozens of cloud services, business instrumentation, and user‑cloud interactions using a self‑developed Arkit probe combined with the open‑source Prometheus ecosystem, plus built‑in exporters, Telegraf, JMX, OpenTelemetry, scripts, logs, synthetic checks, and custom collectors.

Data Storage: Metric Store, built on an open‑source time‑series database, provides dual‑write, periodic aging, high availability, and optimized read/write performance, with WebUI and PromQL query support. Log Store, a 360‑developed data‑compute separation warehouse, handles logs, traces, and supports massive real‑time ingestion, analysis, and federated queries with SQL.

Alert Module: Users configure alert rules, strategies, notification targets, silencing, callbacks, convergence, mute periods, and confirmations via a unified web UI. The module focuses on the “last mile” of monitoring, ensuring reliable delivery of actionable alerts.

Event Center: Aggregates, processes, and analyzes historical alerts from various sources, offering detailed retrieval, statistical summaries, trend analysis, and AI‑driven root‑cause analysis.

Core Alert Capabilities

3.1 Alert Convergence

With tens of thousands of daily alerts, convergence reduces noise by filtering out transient or duplicate events. Three main techniques are used:

Alert Duration Determination Users set a minimum continuous duration; only alerts persisting longer than this threshold are fired, reducing false positives.

The duration setting interacts with detection intervals and data collection cycles, influencing when alerts are triggered.

Example: a CPU usage metric collected every minute, with a rule checking every 2 minutes for >90% sustained for 3 minutes.

(1) 09:00 CPU < 90% → inactive
(2) 09:02 CPU > 90% → pending, first trigger time recorded
(3) 09:04 CPU still > 90% → duration 2 min < 3 min → still pending
(4) 09:06 CPU still > 90% → duration 4 min ≥ 3 min → firing, alert sent

First Alert Wait After an anomaly is detected, the system waits a configurable period before sending the alert, allowing multiple related events to be merged.

Alert Silence (Interval) Before an alert is resolved, the system can throttle repeated notifications based on a configured interval.

3.2 Alert Confirmation

When an alert is claimed by a person, subsequent identical alerts are suppressed, preventing duplicate notifications and reducing handling effort.

3.3 Alert Escalation

If an alert remains unacknowledged and unresolved for a configured time, it is automatically escalated to higher‑level personnel.

3.4 Alert Silencing

Specific problems can be silenced for a period or schedule, preventing unnecessary alerts during deployments or known outages.

3.5 Alert Callback

Configured callbacks invoke external interfaces when an alert fires, enabling automated remediation actions to restore service quickly.

Alert Product Features

4.1 Configuring Alert Rules

The platform offers a unified entry for rule configuration, flexible data filtering, default rule templates, and support for callback interfaces. Built‑in availability monitoring rules are provided out‑of‑the‑box.

4.2 Alert Strategies

Strategies define notification levels, targets, time windows, and mute intervals, ensuring alerts are sent promptly to the right people.

4.3 Alert Silencing

Silencing can be applied to specific hosts, rules, or applications, with configurable time periods.

4.4 Event Center

The center enables historical alert query, analysis, and AI‑driven root‑cause investigation.

4.5 Alert Service Self‑Monitoring

The alert service monitors its own health via external metrics stored in a dedicated Metric Store, ensuring the alerting pipeline itself remains observable.

Future Plans

Simpler configuration through intelligent, context‑aware threshold bands, reducing manual setup.

More specific targets by introducing a "scenario" concept for higher‑value business alerts.

More precise decision support, moving from raw alerts to AI‑assisted remediation recommendations.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Operations Observability Alerting incident response

Written by

360 Zhihui Cloud Developer

360 Zhihui Cloud is an enterprise open service platform that aims to "aggregate data value and empower an intelligent future," leveraging 360's extensive product and technology resources to deliver platform services to customers.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.