Operations 10 min read

Essential Prometheus Best Practices: Avoid Common Pitfalls and Boost Reliability

This article shares practical Prometheus best‑practice tips—from understanding its accuracy‑reliability trade‑offs and self‑monitoring, to avoiding NFS storage, managing high‑cardinality metrics, handling rate() and recording‑rule pitfalls, and fine‑tuning alerting—so you can run a stable, low‑cost monitoring stack.

Efficient Ops

Oct 21, 2024

Essential Prometheus Best Practices: Avoid Common Pitfalls and Boost Reliability

Prometheus is an open‑source monitoring system that has become the de‑facto standard for metric monitoring in cloud‑native environments, with most Kubernetes core components exposing metrics in its format.

Having used Prometheus extensively at work, I find it easy to maintain, simple, and low‑cost, though it does have some pitfalls worth noting.

https://prometheus.io/docs/introduction/overview/

1. Balancing Accuracy and Reliability

Prometheus, as a metric‑based system, sacrifices some data accuracy for higher reliability, resulting in a simple architecture, straightforward data model, and easy operations. Compared with log‑based systems, metrics require far fewer resources.

2. Implement Self‑Monitoring

Who monitors Prometheus itself? The answer is another monitoring system—often a second Prometheus instance. Deploy at least two independent Prometheus servers that scrape each other’s metrics, and also monitor the Alertmanager.

Use a “dead‑man’s switch” alert that always fires; if it stops, the alerting chain is broken.

3. Avoid NFS for Storage

Prometheus does not support NFS storage (see issue https://github.com/prometheus/prometheus/issues/3534). Using NFS can lead to data loss, as we have experienced.

4. Eliminate High‑Cardinality Metrics Early

Over half of storage and 80% of CPU usage are often consumed by a few high‑cardinality metrics, which can cause OOM crashes. Identify such “bad” metrics with alert rules and drop the offending labels in the scrape configuration.

Example alert rule:

# Alert when a metric has more than 10,000 time series
count by (__name__)({__name__=~".+"}) > 10000

After the alert fires, use metric_relabel_configs to drop problematic labels.

5. Beware of Rate() and Recording Rule Interactions

Applying rate() after aggregations like sum() can produce incorrect results because the aggregated series may no longer behave like a Counter. Prefer recording rules that compute the final value directly, avoiding intermediate aggregates that are later fed to rate().

Example problematic recording rule: sum(old_metric) without (bad_label) When the resulting metric is used with rate(), unexpected spikes appear due to resets being mis‑detected.

6. Alert and Graph Mismatch

Alert evaluation times and graph sampling times differ, especially when the graph’s scrape interval is large. This can make alerts appear out of sync with the displayed trend. Align intervals or use recording rules to simplify alerts.

7. Alertmanager’s group_interval Affects Resolved Notifications

The group_interval setting controls how frequently alerts in the same group are sent. Since firing and resolved notifications share the same group, a resolved alert may be delayed by the group interval, making immediate resolution notifications hard to achieve.

8. Remember the Purpose of Monitoring

The core goal of monitoring is to safeguard business stability and enable rapid iteration, not to chase arbitrary counts of metrics or alerts. Adopt an SRE mindset: focus on reliability and cost‑effectiveness rather than sheer coverage.

Original source: https://aleiwu.com/post/prometheus-bp/

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

monitoring Cloud Native Operations Observability Alerting

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.