Operations 21 min read

Why Most Alerts Fail and How to Build Actionable Monitoring

This article explains the fundamental flaws of typical alert systems, distinguishes between business rule and reliability monitoring, outlines essential metrics and strategies for effective alerts, and presents simple yet powerful anomaly‑detection algorithms to ensure alerts are actionable and reduce noise.

Programmer DD

Jun 7, 2019

Why Most Alerts Fail and How to Build Actionable Monitoring

Nature of Alerts

Most system alerts are poorly designed; a good alert must be actionable, allowing immediate assessment of impact and a graded response. Simple thresholds like CPU > 90% often fail to provide high‑quality alerts.

Alert Targets

Alerts can be divided into two categories:

Business rule monitoring – ensures software behaves according to business constraints (e.g., detecting cheating in games).

System reliability monitoring – checks hardware and service health (e.g., server crashes, overload).

Metrics and Strategy

Effective monitoring focuses on three goals:

Is the work getting done? (request volume and success rate)

Is the user having a good experience? (response latency)

Where is the problem or bottleneck?

For databases, key metrics include CPU usage, network bandwidth, request count, response count, error response count, and request latency, but the most informative are absolute request volume and the proportion of successful responses.

Because services depend on each other, a four‑layer hierarchy is useful:

Product strategy & marketing – determines request arrival rate.

Application layer (web layer) – glue code.

Service layer – databases, RPC services, etc.

Hardware layer – CPU, memory, disk, network.

Tracking a single metric in isolation is insufficient; the dependency chain means that a drop in one layer may be caused by upstream demand changes.

Theory vs. Reality

Complex algorithms are rarely needed if the right metrics are collected; static thresholds often suffice. However, three situations require algorithms:

When error counts cannot be directly collected (automatic log classification).

When success rates are unavailable (anomaly detection on request/response counts).

When only aggregate totals are available (factor fitting).

Anomaly Detection

Four intuitive approaches are discussed:

Curve smoothness – detect sudden deviations from recent trends using moving averages or regression.

Absolute value periodicity – compare current values against a baseline derived from historical minima (e.g., 0.6 × minimum of past 14 days).

Amplitude periodicity – analyze differences between consecutive points (or longer intervals) to capture rapid changes.

Curve rebound detection – confirm faults when a dip is followed by a clear recovery.

Each method has strengths and weaknesses; combining multiple signals reduces false positives.

Summary

High‑quality alerts must be actionable.

Do not let data‑collection difficulty dictate metric choice.

Avoid defaulting to CPU usage alerts; focus on request volume and success rate.

Collect the right metrics and simple static thresholds often suffice.

When needed, anomaly‑detection algorithms are straightforward to implement.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Operations Anomaly Detection Alerting Reliability

Written by

Programmer DD

A tinkering programmer and author of "Spring Cloud Microservices in Action"

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.