Operations 10 min read

Essential Prometheus Best Practices: Avoid Common Pitfalls and Boost Reliability

This article shares practical Prometheus best‑practice tips—from understanding its accuracy‑reliability trade‑offs and self‑monitoring, to avoiding NFS storage, managing high‑cardinality metrics, handling rate() and recording‑rule pitfalls, and fine‑tuning alerting—so you can run a stable, low‑cost monitoring stack.

Efficient Ops
Efficient Ops
Efficient Ops
Essential Prometheus Best Practices: Avoid Common Pitfalls and Boost Reliability

Prometheus is an open‑source monitoring system that has become the de‑facto standard for metric monitoring in cloud‑native environments, with most Kubernetes core components exposing metrics in its format.

Having used Prometheus extensively at work, I find it easy to maintain, simple, and low‑cost, though it does have some pitfalls worth noting.

https://prometheus.io/docs/introduction/overview/

1. Balancing Accuracy and Reliability

Prometheus, as a metric‑based system, sacrifices some data accuracy for higher reliability, resulting in a simple architecture, straightforward data model, and easy operations. Compared with log‑based systems, metrics require far fewer resources.

2. Implement Self‑Monitoring

Who monitors Prometheus itself? The answer is another monitoring system—often a second Prometheus instance. Deploy at least two independent Prometheus servers that scrape each other’s metrics, and also monitor the Alertmanager.

Use a “dead‑man’s switch” alert that always fires; if it stops, the alerting chain is broken.

3. Avoid NFS for Storage

Prometheus does not support NFS storage (see issue https://github.com/prometheus/prometheus/issues/3534). Using NFS can lead to data loss, as we have experienced.

4. Eliminate High‑Cardinality Metrics Early

Over half of storage and 80% of CPU usage are often consumed by a few high‑cardinality metrics, which can cause OOM crashes. Identify such “bad” metrics with alert rules and drop the offending labels in the scrape configuration.

Example alert rule:

<code># Alert when a metric has more than 10,000 time series
count by (__name__)({__name__=~".+"}) > 10000</code>

After the alert fires, use

metric_relabel_configs

to drop problematic labels.

5. Beware of Rate() and Recording Rule Interactions

Applying

rate()

after aggregations like

sum()

can produce incorrect results because the aggregated series may no longer behave like a Counter. Prefer recording rules that compute the final value directly, avoiding intermediate aggregates that are later fed to

rate()

.

Example problematic recording rule:

<code>sum(old_metric) without (bad_label)</code>

When the resulting metric is used with

rate()

, unexpected spikes appear due to resets being mis‑detected.

6. Alert and Graph Mismatch

Alert evaluation times and graph sampling times differ, especially when the graph’s scrape interval is large. This can make alerts appear out of sync with the displayed trend. Align intervals or use recording rules to simplify alerts.

7. Alertmanager’s group_interval Affects Resolved Notifications

The

group_interval

setting controls how frequently alerts in the same group are sent. Since firing and resolved notifications share the same group, a resolved alert may be delayed by the group interval, making immediate resolution notifications hard to achieve.

8. Remember the Purpose of Monitoring

The core goal of monitoring is to safeguard business stability and enable rapid iteration, not to chase arbitrary counts of metrics or alerts. Adopt an SRE mindset: focus on reliability and cost‑effectiveness rather than sheer coverage.

Original source: https://aleiwu.com/post/prometheus-bp/

monitoringcloud nativeoperationsObservabilityalertingbest practicesPrometheus
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.