Understanding Prometheus Alerting: When Alerts Fire and Why They May Not
This article explains the principles behind Prometheus alerts, when they trigger, why they sometimes stay silent, and how Alertmanager’s routing tree and notification pipeline work together to manage alert noise, grouping, silencing, and deduplication.
This article provides a clear, beginner‑friendly explanation of when Prometheus generates alerts, when it does not, and the underlying alerting mechanisms.
Alerts are an essential yet complex part of any monitoring system. When an abnormal condition occurs, an alert should notify the appropriate person or channel, but in practice many issues arise, such as intermittent spikes, alert fatigue, overly sensitive thresholds, or confusing alert messages.
Transient metric spikes that disappear quickly.
Too many alerts causing overload.
Thresholds set too low.
Unclear alert meanings.
Massive bursts of alerts during network outages.
These problems show that alerting is more than a simple compute‑plus‑notify task; organizational and management practices are also required.
Alert rule example (with for parameter):
- alert: KubeAPILatencyHigh
annotations:
message: The API server has a 99th percentile latency of {{ $value }} seconds for {{ $labels.verb }} {{ $labels.resource }}.
expr: |
cluster_quantile:apiserver_request_latencies:histogram_quantile{job="apiserver",quantile="0.99",subresource!="log"} > 4
for: 10m
labels:
severity: criticalThis rule fires when the kube‑apiserver 99th‑percentile latency exceeds 4 seconds for a continuous period of at least 10 minutes.
The for field defines a *Pending Duration* that helps filter out short‑lived spikes, ensuring that only sustained problems reach on‑call personnel.
Even if a metric crosses the threshold, the alert will not fire if the duration is insufficient, as illustrated by the accompanying diagrams (omitted here).
Two common questions are explored:
Why does an alert not fire despite a prolonged threshold breach?
Why does an alert fire when the metric does not appear to exceed the threshold?
Both stem from Prometheus’s storage and evaluation model. Metrics are scraped at a configured scrape_interval , producing sparse (timestamp, value) samples. Alert rules are evaluated at fixed intervals, generating sparse evaluation points. Grafana’s range queries may sample at different steps, leading to apparent mismatches between what the alert rule “sees” and what a graph displays.
The article walks through a timeline example showing how sampling points at 40 s, 80 s, 120 s, and 160 s affect alert state transitions from Pending to Firing and finally to resolved.
To mitigate these issues, the author suggests:
Accepting that Prometheus provides approximate data due to sparse sampling.
Inspecting the built‑in ALERTS metric to see the full lifecycle of each alert.
Using Recording Rules to materialize the computed values as new metrics, then alert on those stable series.
Alertmanager is introduced as the component that receives alerts from Prometheus and handles routing, grouping, silencing, inhibition, deduplication, and final delivery.
Routing Tree – a multi‑branch tree where each node defines matching criteria, inheritance of configuration, and optional continuation for multi‑recipient routing. Example Go struct:
// Node containing routing logic
type Route struct {
parent *Route
RouteOpts RouteOpts
Matchers types.Matchers
Continue bool
Routes []*Route
}The matching algorithm performs a depth‑first search, returning the deepest matching node unless Continue is true.
Sample Alertmanager configuration demonstrates inheritance and specific routing for database and frontend alerts.
route:
receiver: 'default-receiver'
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
group_by: [cluster, alertname]
routes:
- receiver: 'database-pager'
group_wait: 10s
match_re:
service: mysql|cassandra
- receiver: 'frontend-pager'
group_by: [product, environment]
match:
team: frontendNotification Pipeline – after routing, alerts pass through a chain of stages (e.g., InhibitStage, SilenceStage, DedupStage, NotifySetStage). The pipeline uses a MultiStage implementation:
// A Stage processes alerts under the constraints of the given context.
type Stage interface {
Exec(ctx context.Context, l log.Logger, alerts ...*types.Alert) (context.Context, []*types.Alert, error)
}
// A MultiStage executes a series of stages sequentially.
type MultiStage []Stage
func (ms MultiStage) Exec(ctx context.Context, l log.Logger, alerts ...*types.Alert) (context.Context, []*types.Alert, error) {
var err error
for _, s := range ms {
if len(alerts) == 0 {
return ctx, nil, nil
}
ctx, alerts, err = s.Exec(ctx, l, alerts...)
if err != nil {
return ctx, nil, err
}
}
return ctx, alerts, nil
}The pipeline handles grouping, waiting ( group_wait ), periodic execution ( group_interval ), deduplication based on repeat_interval , silencing, and inhibition, ensuring high‑quality alert delivery.
In conclusion, Alertmanager’s design focuses on alert governance: the Routing Tree classifies alerts and defines routing logic, while the Notification Pipeline applies suppression, silencing, and deduplication to improve alert quality, though it cannot solve every operational pain point.
Architecture Digest
Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.