Operations 23 min read

Why Most Alerts Fail and How to Build Actionable Monitoring

This article explains why many system alerts are poorly designed, describes the true purpose of alerts as actionable notifications, distinguishes business rule monitoring from reliability monitoring, and presents practical metrics, strategies, and simple anomaly‑detection algorithms to create high‑quality, actionable alerts for reliable operations.

21CTO
21CTO
21CTO
Why Most Alerts Fail and How to Build Actionable Monitoring

The Essence of Alerts

Few alerts are well designed; creating good alerts is very difficult. Bad alerts overwhelm operators, often leading to immediate dismissal. A common rule‑based alert such as CPU usage >90% rarely provides high‑quality information. High‑quality alerts should allow immediate impact assessment and require a graded response, i.e., they must be actionable.

Essentially, an alert treats a human as a service: when automation cannot handle a situation, an alert notifies a person to intervene.

If an alert is triggered but requires no action, it becomes a DDoS attack on the operator’s sanity.

Many alertable problems can be fully automated, such as replacing a failed server.

In small systems, a manual failover to a cold standby may suffice; larger systems need hot standby and automatic failover; very large systems require fully automated pipelines.

Alert Targets

Alert targets fall into two categories:

Business rule monitoring

System reliability monitoring

Business rule monitoring examples include game cheating detection, such as limiting damage output or win‑rate thresholds. It checks whether software follows business rules, i.e., correctness.

System reliability monitoring checks hardware or service health, like server crashes or overloads.

Typical backend services can be modeled as layered dependencies: product/marketing → application layer → service layer (databases, RPC services) → hardware layer (CPU, memory, disk, network).

Metrics at each layer include request rate, success rate, error rate, latency, queue length, etc. Because services depend on each other, a single metric rarely isolates a fault; multiple related metrics must be considered.

Monitoring Metrics and Strategies

Effective monitoring aims to answer three questions:

Is the work getting done?

Is the user having a good experience?

Where is the problem or bottleneck?

For databases, the most relevant metrics are absolute request count and the proportion of successful responses, rather than raw CPU usage.

Similar request‑count and success‑rate metrics apply to other services.

For user experience, monitor average queue time, total latency, and percentile latency values.

Fault localization should be automated: each layer generates its own alerts, top‑level alerts trigger automatic correlation across dependencies, and the system pinpoints the root cause.

Common pitfalls include using easy‑to‑collect metrics (e.g., CPU) instead of meaningful ones, and blindly following operator‑requested alerts rather than true business‑oriented alerts.

Theory vs Reality

Complex algorithms are unnecessary when the right metrics are collected; static thresholds often suffice. However, algorithms become necessary when:

Errors cannot be directly counted and require log classification.

Success rates are not directly measurable and need anomaly detection on request/response counts.

Only aggregate totals are available, requiring factor fitting.

Log classification and advanced analytics (e.g., Sumo Logic) help when raw metrics are insufficient.

Anomaly Detection

When only request counts are available without reference thresholds, detection must rely on pattern changes.

Four intuitive approaches to curve analysis:

Curve smoothness – faults break recent trends.

Absolute value periodicity – two curves overlap closely.

Amplitude periodicity – similar wave patterns despite different baselines.

Presence of a long dip – a sustained drop indicates a fault.

These insights lead to simple algorithms without heavy mathematics.

Curve Smoothness Detection

Using a recent time window (e.g., one hour), compute an exponentially weighted moving average (EWMA): s_t = α·x_t + (1‑α)·s_{t‑1} Calculate variance around the EWMA; if a new point exceeds variance‑based bounds, raise an alert. This method is sensitive to sudden changes but can miss gradual faults.

Absolute Value Time Periodicity

Leverage daily patterns: for each minute of the day, take the minimum value from the past 14 days, multiply by a factor (e.g., 0.6), and use it as a dynamic lower bound. Alert when current values fall below this bound. Use the second‑smallest value to avoid contamination from past incidents.

Amplitude‑Based Time Periodicity

Instead of raw values, examine changes: Δx = x(t) – x(t‑1) or relative change Δx / x(t‑1). Compare these deltas against historical percentiles to detect abnormal spikes or drops, which are more sensitive than absolute thresholds.

Anomaly Detection via Curve Recovery

Observe whether a dip is followed by a clear recovery. A rapid rebound after a dip strongly indicates a fault, but such detection is less useful for real‑time alerting because the issue may already be resolved.

Combining multiple detection methods reduces false positives and improves fault confirmation.

Key Takeaways

High‑quality alerts must be actionable.

Do not choose metrics based on collection ease; prioritize business‑relevant signals.

Avoid blindly copying operator‑suggested alerts; focus on true value, especially avoiding over‑reliance on CPU usage.

Work‑completion metrics: request count + success rate.

User‑experience metrics: response latency.

When the right metrics are collected, simple static thresholds often suffice.

Simple anomaly‑detection algorithms (EWMA, dynamic baselines, delta analysis) are effective and easy to implement.

Author: Wen Tao
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Operationsanomaly detectionmetricsAlerting
21CTO
Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.