Operations 9 min read

Mastering Prometheus: Essential Best Practices and Common Pitfalls

This article shares practical Prometheus monitoring tips, covering accuracy trade‑offs, self‑monitoring setups, storage choices, high‑cardinality metric handling, rate() pitfalls, alert‑graph mismatches, Alertmanager timing issues, and the core purpose of observability for stable business delivery.

Efficient Ops

Feb 25, 2019

Mastering Prometheus: Essential Best Practices and Common Pitfalls

Balancing Accuracy and Reliability

Prometheus sacrifices some metric accuracy to achieve higher reliability, resulting in a simpler architecture and lower operational cost compared to log‑based systems.

Self‑Monitoring Is Essential

Deploy at least two independent Prometheus instances that scrape each other’s metrics, and ensure Alertmanager is highly available or use a “dead‑man’s switch” alert to detect its failure.

Avoid NFS for Storage

Prometheus does not support NFS; using it can corrupt data and cause loss of historical metrics.

Eliminate High‑Cardinality Metrics Early

High‑cardinality metrics consume the majority of storage and CPU; use alert rules to identify and drop them via metric_relabel_configs as soon as possible.

Beware of Rate() with Recording Rules

Applying rate() to a metric that has been summed or otherwise transformed can produce spikes because the resulting series is no longer a pure counter. Compute final values directly in the recording rule to avoid this.

Alert and Graph Mismatch

Differences in sampling intervals between time‑series graphs and alert evaluation can cause alerts to fire while graphs look normal, or vice‑versa. Reduce the scrape interval or create recording rules for complex alerts.

Alertmanager group_interval Affects Resolved Notifications

The group_interval setting delays resolved notifications because they share the same group timing as firing alerts; adjusting the interval or patching the source can mitigate the delay.

Never Forget the Why

Monitoring should serve business stability and rapid iteration, not the sheer number of metrics or alerts. Focus on meaningful coverage and maintain an SRE mindset.

Original article: https://aleiwu.com/post/prometheus-bp/

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Kubernetes Alertmanager

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.