Performance Troubleshooting and Optimization of Prometheus Monitoring Queries
The article explains that high metric cardinality in Prometheus causes long query times and timeouts, and demonstrates how using recording rules to pre‑compute aggregates dramatically reduces cardinality and latency, while recommending scrape interval tuning and metric design best practices to keep charts responsive.
Background : The article discusses performance issues encountered when using a Prometheus‑Grafana monitoring platform. Queries for recent 7‑day data often time out, causing chart loading failures and reducing developer efficiency.
Initial Investigation : By inspecting the network request for a failing chart, the author identified that the query interface to Prometheus was timing out (≈48 seconds) even when the step was increased to 40 minutes, indicating that the bottleneck lies in Prometheus query processing.
Prometheus Query Processing Flow :
Prometheus stores time‑series data in blocks (default 2‑hour range). Each block contains chunks (sample collections for a series) and index (metadata). The index includes two sub‑indexes:
postings index : maps label values to series.
series index : maps series to the chunks that hold their samples.
The query processing consists of five steps:
Determine the blocks that cover the requested time range.
Use the postings index to find series matching the label selectors.
Use the series index to locate the relevant chunks.
Retrieve samples from those chunks.
If the query contains operators, aggregations, or functions, perform additional calculations on the retrieved samples.
Detailed Investigation :
The author notes that larger time ranges, higher label cardinality, and the use of aggregations increase query latency. Cardinality is defined as the number of unique label value combinations for a metric. For the metric http_server_requests_seconds_count , the cardinality reaches 147 610, far higher than the <10 000 range of fast‑loading metrics.
PromQL used to compute per‑label cardinality:
count(count by (label_name) (http_server_requests_seconds_count))Overall metric cardinality query:
count({__name__="http_server_requests_seconds_count"})High cardinality metrics cause the observed slow queries.
Optimization via Recording Rules :
Recording rules pre‑compute expensive expressions and store the results as new time series. The original query:
sum(rate(http_server_requests_seconds_count{application="$application",cluster=~"$cluster",uri!~"/actuator/.*|/\*.*|root|/|/health/check"}[1m])) by (uri)is replaced by a recording rule:
record: http_server_requests_seconds_count:rate:1m:acu expr: sum(rate(http_server_requests_seconds_count{uri !~ "/actuator/.*|/\*.*|root|/|/health/check"}[1m])) by (application,cluster,uri)After creating the rule, historical data can be back‑filled using promtool tsdb create-blocks-from rules (example command shown in the article). The rule’s cardinality dropped from ~147 k to 4 878, and the optimized query becomes:
sum(http_server_requests_seconds_count:rate:1m:acu{application="$application",cluster=~"$cluster"}) by (uri)Performance tests show significant latency reduction, especially for longer time ranges.
Conclusion : Excessive metric cardinality leads to slow or failing Prometheus charts. Using recording rules, adjusting scrape intervals, and pruning unused series are effective mitigation strategies. Metric design should consider cardinality early, e.g., grouping HTTP status codes into categories (1XX‑5XX) to keep cardinality low.
iQIYI Technical Product Team
The technical product team of iQIYI
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.