Mastering Prometheus: Essential Best Practices and Common Pitfalls
This article shares practical Prometheus monitoring tips, covering accuracy trade‑offs, self‑monitoring setups, storage choices, high‑cardinality metric handling, rate() pitfalls, alert‑graph mismatches, Alertmanager timing issues, and the core purpose of observability for stable business delivery.
Balancing Accuracy and Reliability
Prometheus sacrifices some metric accuracy to achieve higher reliability, resulting in a simpler architecture and lower operational cost compared to log‑based systems.
Self‑Monitoring Is Essential
Deploy at least two independent Prometheus instances that scrape each other’s metrics, and ensure Alertmanager is highly available or use a “dead‑man’s switch” alert to detect its failure.
Avoid NFS for Storage
Prometheus does not support NFS; using it can corrupt data and cause loss of historical metrics.
Eliminate High‑Cardinality Metrics Early
High‑cardinality metrics consume the majority of storage and CPU; use alert rules to identify and drop them via metric_relabel_configs as soon as possible.
Beware of Rate() with Recording Rules
Applying rate() to a metric that has been summed or otherwise transformed can produce spikes because the resulting series is no longer a pure counter. Compute final values directly in the recording rule to avoid this.
Alert and Graph Mismatch
Differences in sampling intervals between time‑series graphs and alert evaluation can cause alerts to fire while graphs look normal, or vice‑versa. Reduce the scrape interval or create recording rules for complex alerts.
Alertmanager group_interval Affects Resolved Notifications
The group_interval setting delays resolved notifications because they share the same group timing as firing alerts; adjusting the interval or patching the source can mitigate the delay.
Never Forget the Why
Monitoring should serve business stability and rapid iteration, not the sheer number of metrics or alerts. Focus on meaningful coverage and maintain an SRE mindset.
Original article: https://aleiwu.com/post/prometheus-bp/
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
