Operations 18 min read

Understanding Prometheus Alerting: When Alerts Fire and Why They May Not

This article explains the principles behind Prometheus alerts, when they trigger, why they sometimes stay silent, and how Alertmanager’s routing tree and notification pipeline work together to manage alert noise, grouping, silencing, and deduplication.

Architecture Digest

Feb 24, 2023

Understanding Prometheus Alerting: When Alerts Fire and Why They May Not

This article provides a clear, beginner‑friendly explanation of when Prometheus generates alerts, when it does not, and the underlying alerting mechanisms.

Alerts are an essential yet complex part of any monitoring system. When an abnormal condition occurs, an alert should notify the appropriate person or channel, but in practice many issues arise, such as intermittent spikes, alert fatigue, overly sensitive thresholds, or confusing alert messages.

Transient metric spikes that disappear quickly.

Too many alerts causing overload.

Thresholds set too low.

Unclear alert meanings.

Massive bursts of alerts during network outages.

These problems show that alerting is more than a simple compute‑plus‑notify task; organizational and management practices are also required.

Alert rule example (with for parameter):

- alert: KubeAPILatencyHigh
  annotations:
    message: The API server has a 99th percentile latency of {{ $value }} seconds for {{ $labels.verb }} {{ $labels.resource }}.
  expr: |
    cluster_quantile:apiserver_request_latencies:histogram_quantile{job="apiserver",quantile="0.99",subresource!="log"} > 4
  for: 10m
  labels:
    severity: critical

This rule fires when the kube‑apiserver 99th‑percentile latency exceeds 4 seconds for a continuous period of at least 10 minutes.

The for field defines a *Pending Duration* that helps filter out short‑lived spikes, ensuring that only sustained problems reach on‑call personnel.

Even if a metric crosses the threshold, the alert will not fire if the duration is insufficient, as illustrated by the accompanying diagrams (omitted here).

Two common questions are explored:

Why does an alert not fire despite a prolonged threshold breach?

Why does an alert fire when the metric does not appear to exceed the threshold?

Both stem from Prometheus’s storage and evaluation model. Metrics are scraped at a configured scrape_interval, producing sparse (timestamp, value) samples. Alert rules are evaluated at fixed intervals, generating sparse evaluation points. Grafana’s range queries may sample at different steps, leading to apparent mismatches between what the alert rule “sees” and what a graph displays.

The article walks through a timeline example showing how sampling points at 40 s, 80 s, 120 s, and 160 s affect alert state transitions from Pending to Firing and finally to resolved.

To mitigate these issues, the author suggests:

Accepting that Prometheus provides approximate data due to sparse sampling.

Inspecting the built‑in ALERTS metric to see the full lifecycle of each alert.

Using Recording Rules to materialize the computed values as new metrics, then alert on those stable series.

Alertmanager is introduced as the component that receives alerts from Prometheus and handles routing, grouping, silencing, inhibition, deduplication, and final delivery.

Routing Tree – a multi‑branch tree where each node defines matching criteria, inheritance of configuration, and optional continuation for multi‑recipient routing. Example Go struct:

// Node containing routing logic
 type Route struct {
     parent *Route
     RouteOpts RouteOpts
     Matchers types.Matchers
     Continue bool
     Routes []*Route
 }

The matching algorithm performs a depth‑first search, returning the deepest matching node unless Continue is true.

Sample Alertmanager configuration demonstrates inheritance and specific routing for database and frontend alerts.

route:
  receiver: 'default-receiver'
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  group_by: [cluster, alertname]
  routes:
  - receiver: 'database-pager'
    group_wait: 10s
    match_re:
      service: mysql|cassandra
  - receiver: 'frontend-pager'
    group_by: [product, environment]
    match:
      team: frontend

Notification Pipeline – after routing, alerts pass through a chain of stages (e.g., InhibitStage, SilenceStage, DedupStage, NotifySetStage). The pipeline uses a MultiStage implementation:

// A Stage processes alerts under the constraints of the given context.
 type Stage interface {
     Exec(ctx context.Context, l log.Logger, alerts ...*types.Alert) (context.Context, []*types.Alert, error)
 }

 // A MultiStage executes a series of stages sequentially.
 type MultiStage []Stage

 func (ms MultiStage) Exec(ctx context.Context, l log.Logger, alerts ...*types.Alert) (context.Context, []*types.Alert, error) {
     var err error
     for _, s := range ms {
         if len(alerts) == 0 {
             return ctx, nil, nil
         }
         ctx, alerts, err = s.Exec(ctx, l, alerts...)
         if err != nil {
             return ctx, nil, err
         }
     }
     return ctx, alerts, nil
 }

The pipeline handles grouping, waiting ( group_wait), periodic execution ( group_interval), deduplication based on repeat_interval, silencing, and inhibition, ensuring high‑quality alert delivery.

In conclusion, Alertmanager’s design focuses on alert governance: the Routing Tree classifies alerts and defines routing logic, while the Notification Pipeline applies suppression, silencing, and deduplication to improve alert quality, though it cannot solve every operational pain point.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Monitoring Operations golang Alerting prometheus Alertmanager routing-tree

Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.