Operations 12 min read

Avoid These 6 Common Prometheus Mistakes When Getting Started

This guide translates and condenses six frequent errors new Prometheus users make—high‑cardinality labels, losing valuable tags during aggregation, using bare selectors, omitting the for field, choosing too‑short rate windows, and applying rate‑related functions to wrong metric types—offering practical fixes to improve monitoring reliability.

Efficient Ops

Dec 24, 2023

Avoid These 6 Common Prometheus Mistakes When Getting Started

This article is translated from https://promlabs.com/blog/2022/12/11/avoid-these-6-mistakes-when-getting-started-with-prometheus. The author’s summary of common Prometheus pitfalls is presented here for review and self‑reflection.

Mistake 1: High‑Cardinality Explosion

Prometheus stores time series using multiple labels, which is flexible but can cause severe performance issues (including OOM) if a label’s values are not sufficiently convergent. Adding a high‑cardinality label such as a unique user ID creates a separate series for each value.

Example of a low‑cardinality metric:

http_requests_total{method="POST"}
http_requests_total{method="GET"}
http_requests_total{method="PUT"}
http_requests_total{method="DELETE"}

Adding a user_id label creates many series:

http_requests_total{method="POST",user_id="1"}
http_requests_total{method="POST",user_id="2"}
... (many more) ...
http_requests_total{method="GET",user_id="16434313"}

When the number of distinct users is large, memory usage spikes and can lead to OOM. Avoid high‑cardinality values such as public IPs, email addresses, full HTTP request paths with dynamic IDs, and process IDs unless they form a limited set. Use placeholders (e.g., /api/users/{user_id}/posts/{post_id}) to reduce cardinality.

Mistake 2: Losing Valuable Labels During Aggregation

When writing alert rules, aggregations like sum() drop all labels by default, which can remove useful routing information such as the job label. Preserve needed labels with sum by(job) or use sum without(instance, type) to exclude only unwanted labels.

Mistake 3: Using Bare Selectors

Writing PromQL queries without restricting the selector (e.g., rate(errors_total[5m]) > 10) may pull data from unrelated jobs that share the same metric name, causing false alerts and performance issues. Always scope queries with a label like {job="my-job"}.

Mistake 4: Omitting the for Field in Alert Rules

The for field defines how long a condition must persist before an alert fires, helping to filter out transient spikes. Example without for:

alert: InstanceDown
expr: up == 0

Improved rule with for:

alert: InstanceDown
expr: up == 0
for: 5m

Adding for to most alerts makes them more robust, though it may increase detection latency.

Mistake 5: Using Too‑Short Rate Windows

Rate functions need at least two samples within the window. If the window is shorter than the scrape interval, the function may return no data. Choose a window at least four times the scrape interval to handle occasional scrape failures and alignment issues.

Example of a too‑short window (1 min) on a 15 s scrape interval can miss samples, while a 4× interval (e.g., 60 s) provides reliable results.

Mistake 6: Applying Rate‑Related Functions to Wrong Metric Types

rate()

, irate(), and increase() are designed for counter metrics, which only increase. Using them on gauges (e.g., memory usage) leads to incorrect results because decreases are interpreted as counter resets. deriv() works on gauges but should not be used on counters, as it lacks reset compensation and can produce negative values.

To avoid these mistakes, verify metric types before applying functions, and consider tools like PromLens to help detect mismatches.

Conclusion

The six points above highlight frequent pitfalls for newcomers to Prometheus and provide practical tips to improve monitoring setups.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Observability prometheus PromQL

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.