Operations 7 min read

Why Does Prometheus Sometimes Fail to Trigger Alerts?

This article explains why Prometheus alerts may not fire or may fire unexpectedly, covering the role of the for parameter, sampling intervals, Grafana range queries, and practical steps to diagnose and fix alerting issues.

Programmer DD

Jan 15, 2021

Why Does Prometheus Sometimes Fail to Trigger Alerts?

From the for Parameter

Prometheus evaluates alerts based on a rule definition. A simple example rule is shown below:

- alert: KubeAPILatencyHigh
  annotations:
    message: The API server has a 99th percentile latency of {{ $value }} seconds for {{ $labels.verb }} {{ $labels.resource }}.
  expr: |
    cluster_quantile:apiserver_request_latencies:histogram_quantile{job="apiserver",quantile="0.99",subresource!="log"} > 4
  for: 10m
  labels:
    severity: critical

The rule fires when the kube‑apiserver 99th‑percentile response time exceeds 4 seconds for at least 10 minutes.

The for clause defines a *Pending Duration* that filters out short‑lived spikes, ensuring only sustained problems trigger alerts.

Consequently, if a metric crosses the threshold but does not stay above it for the full for period, the alert will not fire, as illustrated in the first diagram.

Why Doesn’t an Alert Fire?

Even when the metric stays above the threshold, the alert may remain silent because the evaluation interval and the for duration can cause the rule to miss the sustained period.

Why Does an Alert Fire?

Conversely, an alert may fire even if the metric does not appear continuously above the threshold in the Grafana chart, due to the way Prometheus samples data.

Sampling Interval

Prometheus scrapes metrics at the configured scrape_interval, storing them as (timestamp, value) pairs. Alert evaluation also runs at a fixed interval, producing sparse samples that are compared against the for duration.

Grafana issues a Range Query with a step parameter, which determines how often the query samples the data. The mismatch between Prometheus’s evaluation points and Grafana’s query steps can cause the chart to show “valleys” that the alert rule never sees, or hide valleys that the rule does see.

40 s – first evaluation, below threshold.

80 s – second evaluation, above threshold, enters Pending.

120 s – third evaluation, still above; a low sample at 90 s is missed.

160 s – fourth evaluation, still above; Pending reaches 2 minutes, alert fires.

Continues above threshold until 360 s, when it finally drops and the alert resolves.

How to Deal with It

Prometheus alerts are inherently approximate because they operate on sparse samples. Use the built‑in ALERTS metric to inspect the state transitions of each alert.

If more precise values are needed, create a Recording Rule that computes the exact expression you care about, store it as a new metric, and alert on that metric. This technique is widely used in the kube‑prometheus stack.

Is That All?

Alert generation also involves Alertmanager, which performs grouping, inhibition, silencing, deduplication, and noise reduction before sending notifications. These additional steps can also affect whether you actually receive an alert.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Monitoring Observability Ops Alerting prometheus grafana

Written by

Programmer DD

A tinkering programmer and author of "Spring Cloud Microservices in Action"

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.