Operations 19 min read

Design, Implementation, and Governance of an Alert Management Platform

The article details Bilibili’s comprehensive alert‑management platform—its background, cloud‑vs‑self‑built solution comparison, closed‑loop design, distributed architecture, rule configuration, noise‑reduction, automated root‑cause analysis, and governance practices that cut weekly alerts from 1,000 to under 80, while outlining future enhancements.

Bilibili Tech

Sep 8, 2023

Design, Implementation, and Governance of an Alert Management Platform

This article presents a comprehensive overview of an alert (alarm) management platform used at Bilibili, covering its background, design, implementation details, governance practices, and future directions.

Background : Alerts are triggered when monitoring data deviates from expected thresholds. They are essential for early detection of bugs and meeting stability goals such as the 1‑minute detection target in the 1‑5‑10 reliability model. The complexity arises from covering multiple business lines, regions, user roles, and integrating both internal and external alert sources.

Solution Comparison :

Cloud Platform Solution – Provides multi‑tenant isolation, strong integration, notification, collaboration, and post‑incident analysis capabilities, but configuration efficiency is lower.

Self‑Built (Factory) Solution – Allows one‑click rule configuration, standardized alert packaging, and seamless integration with both IDC and cloud workloads.

Top‑Level Design :

The platform follows a closed‑loop model that includes alert definition, governance, notification, escalation, and feedback. Key components include:

Alert definition & governance stages to improve recall and accuracy.

Alert grading standards to reduce noise.

Operational mechanisms for continuous improvement.

Architecture :

The platform consists of a high‑level product layer (rule integration, notification, collaboration, analysis) and an engine layer (distributed alert calculation, channels). It stores raw monitoring data, historical alerts, and provides APIs for downstream services.

Detailed Implementation :

Alert Rule Integration & Configuration : Supports both ToB (scenario platform) and ToC (business developers) rule creation, with UI screenshots.

Distributed Alert Calculation : Uses a PromQL‑compatible engine, global and local schedulers, and containerized compute nodes for low latency (30 s interval, detection within 1 min).

Alert Integration : Ingests third‑party alerts (cloud monitoring, logs, coredump) and routes them to appropriate groups.

Alert Groups : Configurable personnel (static, on‑call, bots) and channels (WeChat, phone, email) with escalation policies.

Notification Templates : Supports text and card formats, customizable per scenario.

Noise Reduction : Includes merging, suppression, pre‑silence, dynamic interval, and do‑not‑disturb features.

Alert Handling : Acknowledgement, silencing, one‑click group creation, related information display (trend charts, dashboard links).

Root‑Cause Analysis : Provides automated RCA suggestions using multi‑dimensional metric drilling, trace correlation, and knowledge‑graph scoring.

Analysis & Operations : Search, dashboards, statistical reports, and subscription mechanisms for teams.

Governance Practice :

Problem analysis identified excessive default alerts, over‑notification, inaccurate recipients, lack of operational mechanisms, and incomplete tooling.

Target set: Reduce median alerts from 1,000/week to 80/week.

Actions taken: Data‑driven analysis, rule‑level silencing, personal reject, multi‑threshold suppression, dynamic intervals, subscription integration, dashboards, and optimization of default alert items and recipients.

Results after ~6 months: Median alerts down to 74/week (7.4% of original), total notifications reduced from >3 M to ~300 k (10%), per‑person notifications from 1.6 k to 200 (12.5%).

Future Outlook includes stricter alert integration admission, regularized alert operations, improved configuration efficiency (Grafana integration), enhanced root‑cause aggregation, intelligent noise reduction using service dependency and semantic similarity, micro‑tuning of rules, and reliability/performance guarantees.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

devops SRE Alert Management incident response

Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.