Operations 16 min read

My Philosophy on Alerting: Principles for Effective Monitoring and Incident Management

This article translates and expands on the author’s seven‑year experience with monitoring and alerting, presenting symptom‑based principles, practical guidelines for rule design, incident handling, and operational processes to create a robust, low‑noise alerting system.

High Availability Architecture

Nov 6, 2020

My Philosophy on Alerting: Principles for Effective Monitoring and Incident Management

The author shares a translated version of the seminal essay “My Philosophy on Alerting,” outlining seven years of experience in building monitoring and alerting systems and offering a set of guiding principles for designing effective alerts.

Key principles for alert rules include:

Alerts must be urgent, important, actionable, and truthful.

Rules should indicate that a service is experiencing or about to experience a problem.

Prioritize removing noise; over‑monitoring is harder to resolve than under‑monitoring.

Classify problems into availability, latency, correctness (integrity, freshness, durability), and functional issues.

Describe symptoms rather than causes to capture issues more comprehensively.

Include cause information on symptom‑based dashboards but avoid alerting directly on causes.

Higher‑level alerts can cover broader issues but should not be so generic that they lose diagnostic value.

Implement automated processes to handle low‑priority alerts quietly during on‑call periods.

Introduction

After serving a variety of services—large and small, fast‑moving products and core infrastructure—for seven years, the author has formed a philosophy that emphasizes actionable, low‑noise alerts and the importance of human judgment.

User‑Centric Monitoring

The author advocates “symptom‑based monitoring” rather than “cause‑based monitoring,” stressing that users care about outcomes (e.g., query failures, missing data) rather than internal components like MySQL.

Basic availability and correctness: no 500 errors, no hanging requests, no missing assets.

Latency: responses must be fast.

Data integrity/freshness/durability: user data should be safe and up‑to‑date.

Functionality: all critical features must work.

Why Cause‑Based Alerts Are Often Poor

Focusing on causes can create redundant alerts, increase false positives, and add maintenance overhead. However, cause‑based alerts are sometimes necessary for pre‑emptive issues like nearing quota limits.

Alerting from the Edge

Client‑side metrics provide valuable insight into user‑perceived latency and errors, often offering a more robust view than server‑side metrics alone.

When Causes Are Useful

Cause‑based rules can help quickly pinpoint known defects, especially when symptom rules already exist; a concise summary of cause alerts should be included in alert messages.

TooMany500StatusCodes</code><code>Served 10.7% 5xx results in the last 3 minutes!</code><code>Also firing:</code><code>JanitorProcessNotKeepingUp</code><code>UserDatabaseShardDown</code><code>FreshnessIndexBehind

Operational Playbooks

Each symptom‑based alert should have a corresponding playbook entry describing its meaning and remediation steps.

Tracking and Accountability

All alerts must be tracked; low‑accuracy alerts (<50% true positives) should be retired or downgraded, and regular reviews should be conducted.

Exceptions to the Rules

The author acknowledges scenarios where strict symptom‑based principles may be relaxed, such as rare but critical cause‑based alerts or when monitoring granularity is limited.

Conclusion

By adopting symptom‑focused, low‑noise alerting, maintaining clear operational procedures, and ensuring accountability, teams can reduce alert fatigue and improve incident response effectiveness.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

monitoring Operations Observability

Written by

High Availability Architecture

Official account for High Availability Architecture.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.