Why Does Prometheus Sometimes Fail to Trigger Alerts?
This article explains why Prometheus alerts may not fire or may fire unexpectedly, covering the role of the for parameter, sampling intervals, Grafana range queries, and practical steps to diagnose and fix alerting issues.
From the for Parameter
Prometheus evaluates alerts based on a rule definition. A simple example rule is shown below:
- alert: KubeAPILatencyHigh
annotations:
message: The API server has a 99th percentile latency of {{ $value }} seconds for {{ $labels.verb }} {{ $labels.resource }}.
expr: |
cluster_quantile:apiserver_request_latencies:histogram_quantile{job="apiserver",quantile="0.99",subresource!="log"} > 4
for: 10m
labels:
severity: criticalThe rule fires when the kube‑apiserver 99th‑percentile response time exceeds 4 seconds for at least 10 minutes.
The for clause defines a *Pending Duration* that filters out short‑lived spikes, ensuring only sustained problems trigger alerts.
Consequently, if a metric crosses the threshold but does not stay above it for the full for period, the alert will not fire, as illustrated in the first diagram.
Why Doesn’t an Alert Fire?
Even when the metric stays above the threshold, the alert may remain silent because the evaluation interval and the for duration can cause the rule to miss the sustained period.
Why Does an Alert Fire?
Conversely, an alert may fire even if the metric does not appear continuously above the threshold in the Grafana chart, due to the way Prometheus samples data.
Sampling Interval
Prometheus scrapes metrics at the configured scrape_interval, storing them as (timestamp, value) pairs. Alert evaluation also runs at a fixed interval, producing sparse samples that are compared against the for duration.
Grafana issues a Range Query with a step parameter, which determines how often the query samples the data. The mismatch between Prometheus’s evaluation points and Grafana’s query steps can cause the chart to show “valleys” that the alert rule never sees, or hide valleys that the rule does see.
40 s – first evaluation, below threshold.
80 s – second evaluation, above threshold, enters Pending.
120 s – third evaluation, still above; a low sample at 90 s is missed.
160 s – fourth evaluation, still above; Pending reaches 2 minutes, alert fires.
Continues above threshold until 360 s, when it finally drops and the alert resolves.
How to Deal with It
Prometheus alerts are inherently approximate because they operate on sparse samples. Use the built‑in ALERTS metric to inspect the state transitions of each alert.
If more precise values are needed, create a Recording Rule that computes the exact expression you care about, store it as a new metric, and alert on that metric. This technique is widely used in the kube‑prometheus stack.
Is That All?
Alert generation also involves Alertmanager, which performs grouping, inhibition, silencing, deduplication, and noise reduction before sending notifications. These additional steps can also affect whether you actually receive an alert.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Programmer DD
A tinkering programmer and author of "Spring Cloud Microservices in Action"
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
