Mastering Prometheus in Kubernetes: Practical Tips, Exporter Guide, and Common Pitfalls
This article shares practical experiences with Prometheus in Kubernetes, covering core principles, limitations, common exporters, metric selection, capacity planning, high‑availability strategies, query optimization, and integration with Grafana, offering actionable guidance for building reliable, scalable monitoring solutions.
Prometheus is a modern open‑source monitoring system that has become the de‑facto standard in cloud‑native environments.
Key Principles
Monitoring is infrastructure; collect only necessary metrics to avoid waste of resources.
Only emit alerts that need to be handled.
A simple, reliable architecture is essential; the monitoring system must not fail when the business system does.
Limitations of Prometheus
Metric‑based only – not suitable for logs, events, or tracing.
Default pull model; plan network topology to avoid unnecessary forwarding.
No built‑in solution for horizontal scaling – choose between federation, Cortex, Thanos, etc.
Availability > consistency; occasional data loss is acceptable for query success.
Functions like rate and histogram_quantile can produce unintuitive results; long‑range queries need down‑sampling.
Common Exporters in Kubernetes
cAdvisor (built into kubelet)
kubelet (ports 10255/10250)
apiserver (port 6443)
scheduler (port 10251)
controller‑manager (port 10252)
etcd (latency, storage metrics)
docker (experimental metrics‑addr)
kube‑proxy (port 10249)
kube‑state‑metrics (metadata of pods, deployments, etc.)
node‑exporter (CPU, memory, disk)
blackbox_exporter (network probes)
process‑exporter (process metrics)
nvidia‑exporter (GPU metrics)
node‑problem‑detector (node health)
Application exporters (MySQL, Nginx, MQ, …)
Grafana Dashboards for Core K8s Components
Using the metrics from the exporters above, Grafana can render dashboards for kubelet, apiserver, scheduler, controller‑manager, etc.
All‑in‑One Collector
Exporters can be launched as child processes of a main binary, or Telegraf can be used to aggregate multiple inputs into a single exporter.
Selecting Golden Metrics
Follow Google SRE’s “four golden signals” (latency, traffic, errors, saturation). Use the Use method (Utilization, Saturation, Errors) for resource‑centric metrics and the Red method (Rate, Errors, Duration) for service‑centric metrics.
Version Compatibility
Prometheus 2.16 is the current stable release; older 1.x versions are no longer recommended.
Memory and Storage Planning
Memory usage spikes during the 2‑hour block compaction. Large query ranges and heavy aggregation increase memory pressure. Reduce series count, increase scrape interval, or use remote‑write solutions (Thanos, Victoriametrics) to mitigate.
rate(prometheus_tsdb_compaction_chunk_size_bytes_sum[1h]) / rate(prometheus_tsdb_compaction_chunk_samples_sum[1h])Disk usage can be estimated with the formula shown in the original article (samples × bytes_per_sample × retention_seconds).
High‑Availability Solutions
Basic HA: two identical Prometheus instances behind a load balancer.
HA + remote write: replicate data to an external TSDB.
Federation: shard data by function and aggregate with a global node.
Thanos or Victoriametrics: deduplicate and query across multiple replicas.
Alerting and Operator Wrappers
Alertmanager provides grouping, inhibition, and routing, but many teams build a UI‑driven wrapper to let non‑engineers configure alerts without writing PromQL. Grafana’s experimental alerting can be used for simple cases.
Logging and Events
Log collection is delegated to Fluentd/Fluent‑Bit/Filebeat and stored in Elasticsearch or object storage. Log‑to‑metric conversion can be done with mtail or grok. Kubernetes events should be persisted via tools like kube‑eventer or event‑exporter, optionally exposing them as Prometheus metrics.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Programmer DD
A tinkering programmer and author of "Spring Cloud Microservices in Action"
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
