KeMonitor Alert Platform: Systematic Alert Governance and Practices
The article presents a comprehensive case study of KeMonitor, a one‑stop monitoring and alert platform built by 贝壳找房 to unify fragmented alerts, define lifecycle‑based governance, standardize alert metadata, implement graded subscription, on‑call escalation, silencing, self‑healing, and post‑mortem analysis, thereby improving incident response efficiency and reducing alert fatigue.
KeMonitor is a one‑stop monitoring and alert platform developed by 贝壳找房 to address the fragmented alert landscape that existed between 2018 and 2020, where multiple systems (CAT, Prometheus, ELK, Skywalking, etc.) each had independent alarm capabilities, causing developers to switch between many tools.
Problems and challenges included dispersed alert entry points, excessive daily alerts, missed critical alarms, and lack of systematic SOPs for alert handling.
Solution approach follows a full lifecycle (pre‑, during‑, post‑incident) management model.
Pre‑stage : focus on alert completeness by defining a health‑score metric and standardizing alert metadata (source, level, scenario, background knowledge, location steps, mitigation suggestions).
Alert levels : 0‑level (critical, service‑wide impact), 1‑level (potentially critical, e.g., MySQL slow queries), with detailed sub‑categories for impact scope.
Completeness measurement : use health‑score activities to ensure coverage of required monitoring scenarios.
Subscription (during‑stage) introduces unified alert recipients and delivery methods, graded application groups, and robot reminders to reduce missed alerts.
Unified recipients are mapped to services and departments, ensuring at least two current developers are assigned to each service.
Delivery is tiered: critical alerts go to high‑priority channels, while less urgent alerts are routed to lower‑priority groups.
Robot reminders provide @‑mentions, quick links for follow‑up, mitigation, and resolution actions.
On‑call escalation implements multi‑level escalation (first‑line on‑call, manager, director) with a default 3‑minute response window at each level.
Alert silencing and self‑healing allow temporary or permanent suppression of specific alerts (e.g., during maintenance) and embed quick‑action links for common remediation steps such as restart, circuit‑break, or downgrade.
Post‑incident review aggregates alert histories, enables batch processing, and records resolution conclusions (e.g., "false alarm", "external fault", "planned", "resolved"). This supports systematic analysis of long‑term trends and helps refine monitoring rules.
Self‑service integration offers one‑click import of alert rules and end‑to‑end configuration, reducing onboarding time from days to minutes.
Conclusion : By unifying alert reception, standardizing metadata, providing graded delivery, robot‑assisted SOPs, and comprehensive post‑mortem analysis, KeMonitor significantly reduces alert fatigue, improves response efficiency, and lays the groundwork for future template‑driven, "no‑code" monitoring capabilities.
Beike Product & Technology
As Beike's official product and technology account, we are committed to building a platform for sharing Beike's product and technology insights, targeting internet/O2O developers and product professionals. We share high-quality original articles, tech salon events, and recruitment information weekly. Welcome to follow us.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.