Essential Prometheus Best Practices: Avoid Common Pitfalls and Boost Reliability
This article shares practical Prometheus best‑practice tips—from understanding its accuracy‑reliability trade‑offs and self‑monitoring, to avoiding NFS storage, managing high‑cardinality metrics, handling rate() and recording‑rule pitfalls, and fine‑tuning alerting—so you can run a stable, low‑cost monitoring stack.
Prometheus is an open‑source monitoring system that has become the de‑facto standard for metric monitoring in cloud‑native environments, with most Kubernetes core components exposing metrics in its format.
Having used Prometheus extensively at work, I find it easy to maintain, simple, and low‑cost, though it does have some pitfalls worth noting.
https://prometheus.io/docs/introduction/overview/
1. Balancing Accuracy and Reliability
Prometheus, as a metric‑based system, sacrifices some data accuracy for higher reliability, resulting in a simple architecture, straightforward data model, and easy operations. Compared with log‑based systems, metrics require far fewer resources.
2. Implement Self‑Monitoring
Who monitors Prometheus itself? The answer is another monitoring system—often a second Prometheus instance. Deploy at least two independent Prometheus servers that scrape each other’s metrics, and also monitor the Alertmanager.
Use a “dead‑man’s switch” alert that always fires; if it stops, the alerting chain is broken.
3. Avoid NFS for Storage
Prometheus does not support NFS storage (see issue https://github.com/prometheus/prometheus/issues/3534). Using NFS can lead to data loss, as we have experienced.
4. Eliminate High‑Cardinality Metrics Early
Over half of storage and 80% of CPU usage are often consumed by a few high‑cardinality metrics, which can cause OOM crashes. Identify such “bad” metrics with alert rules and drop the offending labels in the scrape configuration.
Example alert rule:
<code># Alert when a metric has more than 10,000 time series
count by (__name__)({__name__=~".+"}) > 10000</code>After the alert fires, use
metric_relabel_configsto drop problematic labels.
5. Beware of Rate() and Recording Rule Interactions
Applying
rate()after aggregations like
sum()can produce incorrect results because the aggregated series may no longer behave like a Counter. Prefer recording rules that compute the final value directly, avoiding intermediate aggregates that are later fed to
rate().
Example problematic recording rule:
<code>sum(old_metric) without (bad_label)</code>When the resulting metric is used with
rate(), unexpected spikes appear due to resets being mis‑detected.
6. Alert and Graph Mismatch
Alert evaluation times and graph sampling times differ, especially when the graph’s scrape interval is large. This can make alerts appear out of sync with the displayed trend. Align intervals or use recording rules to simplify alerts.
7. Alertmanager’s group_interval Affects Resolved Notifications
The
group_intervalsetting controls how frequently alerts in the same group are sent. Since firing and resolved notifications share the same group, a resolved alert may be delayed by the group interval, making immediate resolution notifications hard to achieve.
8. Remember the Purpose of Monitoring
The core goal of monitoring is to safeguard business stability and enable rapid iteration, not to chase arbitrary counts of metrics or alerts. Adopt an SRE mindset: focus on reliability and cost‑effectiveness rather than sheer coverage.
Original source: https://aleiwu.com/post/prometheus-bp/
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.