Operations 10 min read

Why Prometheus Queries Slow Down and How Recording Rules Speed Them Up

The article examines performance bottlenecks in Prometheus‑Grafana monitoring dashboards caused by high metric cardinality, explains the internal query processing steps, demonstrates how to analyze and reduce cardinality with PromQL and recording rules, and shows concrete command‑line examples that dramatically improve query latency.

dbaplus Community

Aug 7, 2023

Why Prometheus Queries Slow Down and How Recording Rules Speed Them Up

Background

According to SRE: Google Operations , a monitoring system should support both white‑box and black‑box monitoring. The team built a Prometheus + Grafana platform, but recently observed that some dashboard charts load extremely slowly or fail when querying the last seven days of data, severely impacting developer productivity.

Investigation

1. Initial check – Selecting a failing chart and inspecting the network tab revealed that the metric‑query API took up to 48 seconds even when the scrape interval was increased to 40 minutes, indicating the delay occurs inside Prometheus.

2. Prometheus query processing flow – Prometheus stores time‑series data in blocks (default 2‑hour ranges). Each block contains chunks (samples for a single series) and an index with two sub‑indexes: the postings index (label‑to‑series mapping) and the series index (series‑to‑chunk mapping). The query execution follows five steps:

Identify blocks covering the requested time range.

Use the postings index to find series matching the label selectors.

Use the series index to locate the corresponding chunks.

Retrieve sample data from those chunks.

If the query contains operators, aggregations, or functions, perform additional calculations on the retrieved samples.

3. Detailed analysis – The team introduced the concept of cardinality (the number of distinct label combinations). For the metric http_server_requests_seconds_count, the job label has two values and the level label has five, giving a cardinality of 10. A PromQL query was used to count the cardinality of each label:

count(count by (label_name) (http_server_requests_seconds_count))

The result showed that instance and uri labels each had very high cardinality (≈147 610). Comparable fast‑loading metrics had cardinalities below 10 000. The high cardinality, together with aggregation functions like sum and rate, was identified as the main cause of the long query times.

Optimization

Prometheus offers recording rules to pre‑compute expensive expressions and store them as new time‑series. The original dashboard query was:

sum(rate(http_server_requests_seconds_count{application="$application", cluster=~"$cluster", uri!~"/actuator/.*|/\*.*|root|/|/health/check"}[1m])) by (uri)

A recording rule was created to materialise the aggregated data:

record: http_server_requests_seconds_count:rate:1m:acu
expr: sum(rate(http_server_requests_seconds_count{uri !~ "/actuator/.*|/\*.*|root|/|/health/check"}[1m])) by (application,cluster,uri)

Since recording rules only contain data generated after their creation, historical data must be backfilled. From Prometheus v2.27 onward, the promtool tsdb create-blocks-from rules command can generate blocks for the missing period:

promtool tsdb create-blocks-from rules \
  --start 1680348042 \
  --end 1682421642 \
  --url http://mypromserver.com:9090 \
  rules.yaml

After moving the generated blocks into the running Prometheus data directory, the metric’s cardinality dropped to 4 878. The dashboard query was updated to use the new recorded series:

sum(http_server_requests_seconds_count:rate:1m:acu{application="$application", cluster=~"$cluster"}) by (uri)

Performance tests showed that the longer the query time range, the more pronounced the speedup, confirming that reduced cardinality and pre‑computed functions greatly improve latency.

Conclusion

When Prometheus metrics have excessive cardinality, dashboard charts become slow or fail. Using recording rules to pre‑aggregate data and backfilling historic blocks can dramatically lower cardinality and query time. Additional best practices include increasing scrape intervals, deleting unused series, and designing metrics with limited label value sets (e.g., grouping HTTP status codes into 1XX‑5XX buckets) during the metric design phase.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance PromQL Recording Rules Cardinality

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.