Operations 9 min read

Avoid Common Prometheus Pitfalls: Best Practices for Reliable Monitoring

This article shares practical Prometheus best‑practice tips, covering the accuracy‑reliability trade‑off, self‑monitoring setups, avoiding NFS storage, pruning high‑cardinality metrics, handling rate‑function traps, alert‑graph mismatches, group_interval effects, and the overarching goal of stable, cost‑effective observability.

dbaplus Community

Oct 28, 2019

Avoid Common Prometheus Pitfalls: Best Practices for Reliable Monitoring

Balancing Accuracy and Reliability

Prometheus trades some metric accuracy for simplicity and reliability; short‑term spikes may be missed and values like QPS, P95, P99 are estimates. Users should understand this trade‑off early and communicate it.

Self‑Monitoring Prometheus

Deploy at least two independent Prometheus instances that scrape each other’s metrics. Also ensure Alertmanager is highly available or use a “dead‑man’s switch” alert that always fires to detect alert‑chain failures.

Avoid NFS for Storage

Prometheus does not support NFS; using it can corrupt data and cause loss of historical metrics.

Eliminate High‑Cardinality Metrics Early

High‑cardinality labels consume >50% storage and >80% CPU. Identify them with an alert such as count by (__name__)({__name__=~".+"}) > 10000 and drop offending labels via metric_relabel_configs.

Rate Function and Recording Rule Pitfalls

Applying rate() after a sum() or other aggregation can produce spikes because the result is no longer a pure counter. Record the final value directly in a Recording Rule instead of creating an intermediate metric.

Alert vs. Graph Mismatch

Alert evaluation intervals differ from graph sampling intervals, so alerts may fire while graphs appear normal, or vice‑versa. Reduce graph sampling interval or create a Recording Rule for complex alerts to improve visibility.

group_interval Affects Resolved Notifications

Alertmanager’s group_interval applies to both firing and resolved alerts, causing resolved notifications to be delayed. Adjusting the interval or patching the source can change this behavior.

Final Reminder

Monitoring’s purpose is to keep the business stable, not to chase metric or alert count. Developers should adopt an SRE mindset, focusing on reliability and cost‑effective observability.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Operations Alerting best practices Prometheus

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.