Operations 10 min read

Performance Troubleshooting and Optimization of Prometheus Monitoring Queries

The article explains that high metric cardinality in Prometheus causes long query times and timeouts, and demonstrates how using recording rules to pre‑compute aggregates dramatically reduces cardinality and latency, while recommending scrape interval tuning and metric design best practices to keep charts responsive.

iQIYI Technical Product Team

May 12, 2023

Performance Troubleshooting and Optimization of Prometheus Monitoring Queries

Background : The article discusses performance issues encountered when using a Prometheus‑Grafana monitoring platform. Queries for recent 7‑day data often time out, causing chart loading failures and reducing developer efficiency.

Initial Investigation : By inspecting the network request for a failing chart, the author identified that the query interface to Prometheus was timing out (≈48 seconds) even when the step was increased to 40 minutes, indicating that the bottleneck lies in Prometheus query processing.

Prometheus Query Processing Flow :

Prometheus stores time‑series data in blocks (default 2‑hour range). Each block contains chunks (sample collections for a series) and index (metadata). The index includes two sub‑indexes: postings index: maps label values to series. series index: maps series to the chunks that hold their samples.

The query processing consists of five steps:

Determine the blocks that cover the requested time range.

Use the postings index to find series matching the label selectors.

Use the series index to locate the relevant chunks.

Retrieve samples from those chunks.

If the query contains operators, aggregations, or functions, perform additional calculations on the retrieved samples.

Detailed Investigation :

The author notes that larger time ranges, higher label cardinality, and the use of aggregations increase query latency. Cardinality is defined as the number of unique label value combinations for a metric. For the metric http_server_requests_seconds_count, the cardinality reaches 147 610, far higher than the <10 000 range of fast‑loading metrics.

PromQL used to compute per‑label cardinality:

count(count by (label_name) (http_server_requests_seconds_count))

Overall metric cardinality query: count({__name__="http_server_requests_seconds_count"}) High cardinality metrics cause the observed slow queries.

Optimization via Recording Rules :

Recording rules pre‑compute expensive expressions and store the results as new time series. The original query:

sum(rate(http_server_requests_seconds_count{application="$application",cluster=~"$cluster",uri!~"/actuator/.*|/\*.*|root|/|/health/check"}[1m])) by (uri)

is replaced by a recording rule:

record: http_server_requests_seconds_count:rate:1m:acu

expr: sum(rate(http_server_requests_seconds_count{uri !~ "/actuator/.*|/\*.*|root|/|/health/check"}[1m])) by (application,cluster,uri)

After creating the rule, historical data can be back‑filled using promtool tsdb create-blocks-from rules (example command shown in the article). The rule’s cardinality dropped from ~147 k to 4 878, and the optimized query becomes:

sum(http_server_requests_seconds_count:rate:1m:acu{application="$application",cluster=~"$cluster"}) by (uri)

Performance tests show significant latency reduction, especially for longer time ranges.

Conclusion : Excessive metric cardinality leads to slow or failing Prometheus charts. Using recording rules, adjusting scrape intervals, and pruning unused series are effective mitigation strategies. Metric design should consider cardinality early, e.g., grouping HTTP status codes into categories (1XX‑5XX) to keep cardinality low.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Query Optimization SRE prometheus Recording Rules

Written by

iQIYI Technical Product Team

The technical product team of iQIYI

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.