Cloud Native 19 min read

Mastering Prometheus on Kubernetes: Practical Tips, Exporter Guide, and Capacity Planning

This article explores the history and principles of Prometheus monitoring, offers guidance on version selection, highlights its limitations, details common Kubernetes exporters, shows Grafana dashboard setups, and provides in‑depth strategies for exporter aggregation, golden metrics, multi‑cluster scraping, GPU monitoring, timezone handling, memory optimization, capacity planning, and rate calculations.

Efficient Ops
Efficient Ops
Efficient Ops
Mastering Prometheus on Kubernetes: Practical Tips, Exporter Guide, and Capacity Planning

Prometheus, a modern open‑source monitoring system, has become the de‑facto standard in cloud‑native environments, offering a mature solution for infrastructure observability.

Key principles include treating monitoring as infrastructure, emitting only actionable alerts, and keeping the architecture simple to avoid single points of failure.

Version selection

Use the latest stable release (e.g., 2.16) and avoid older 1.x versions; the experimental UI in 2.16 provides TSDB status and top labels/metrics.

Prometheus limitations

Metric‑based monitoring does not cover logs, events, or tracing.

It uses a pull model; plan network topology accordingly.

For clustering and scaling, choose between Federate, Cortex, Thanos, etc.

Prioritize availability over consistency; occasional data loss is acceptable.

Statistical functions (rate, histogram_quantile) can produce unintuitive results, and long‑range queries may lose precision.

Common exporters in Kubernetes

cAdvisor (built into kubelet)

kubelet (ports 10255/10250)

apiserver (port 6443)

scheduler (port 10251)

controller‑manager (port 10252)

etcd

docker (experimental metrics‑addr)

kube‑proxy (default 127.0.0.1:10249)

kube‑state‑metrics

node‑exporter

blackbox_exporter

process‑exporter

nvidia exporter (GPU metrics)

node‑problem‑detector (NPd)

application exporters (mysql, nginx, mq, etc.)

Custom exporters can be created to fill gaps, though managing many exporters adds operational overhead.

Kubernetes core component monitoring and Grafana panels

Metrics from the above exporters can be visualized in Grafana dashboards for components such as kubelet, apiserver, and others.

All‑in‑one collection approaches

Launch multiple exporter processes from a single main process, keeping them up‑to‑date with community releases.

Use Telegraf to aggregate various inputs into a single collector.

Node‑exporter does not monitor processes; a process‑exporter or Telegraf's procstat input can fill this gap.

Selecting golden metrics

Follow Google SRE’s four golden signals—latency, traffic, errors, saturation—and apply the Use (Utilization, Saturation, Errors) or Red (Rate, Errors, Duration) methods depending on service type.

Cadvisor label compatibility in Kubernetes 1.16

Labels pod_name and container_name were re‑added via metric_relabel_configs to maintain compatibility with older queries.

metric_relabel_configs:
- source_labels: [container]
  regex: (.+)
  target_label: container_name
  replacement: $1
  action: replace
- source_labels: [pod]
  regex: (.+)
  target_label: pod_name
  replacement: $1
  action: replace

Scraping external or multi‑cluster Kubernetes

When Prometheus runs outside a cluster, configure kubernetes_sd_configs with appropriate api_server, bearer_token_file, and TLS settings. Use __metrics_path__ rewrites to proxy through the apiserver or directly to kubelet ports.

- job_name: cluster-cadvisor
  honor_timestamps: true
  scrape_interval: 30s
  scheme: https
  kubernetes_sd_configs:
  - api_server: https://xx:6443
    role: node
    bearer_token_file: token/cluster.token
    tls_config:
      insecure_skip_verify: true
  relabel_configs:
  - source_labels: [__meta_kubernetes_node_name]
    target_label: __metrics_path__
    replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
    action: replace
  metric_relabel_configs:
  - source_labels: [container]
    target_label: container_name
    replacement: $1
    action: replace
  - source_labels: [pod]
    target_label: pod_name
    replacement: $1
    action: replace

GPU metrics

cAdvisor exposes GPU metrics such as container_accelerator_duty_cycle, container_accelerator_memory_total_bytes, and container_accelerator_memory_used_bytes. For richer data, install the DCGM exporter (requires Kubernetes 1.13+).

Changing Prometheus display timezone

Prometheus stores timestamps in UTC and does not support timezone configuration; visualization tools like Grafana handle timezone conversion, and the newer Web UI (2.16) offers a local timezone option.

Collecting metrics behind a LoadBalancer

Use sidecar proxies on the backend services or configure the LB to forward specific paths to each backend, allowing Prometheus to scrape the underlying pods.

Prometheus memory consumption

Memory usage spikes during the 2‑hour block compaction phase and with large queries (e.g., wide‑range rate or group). Mitigation strategies include sharding, reducing series count, evaluating high‑cost metrics, limiting query ranges, and avoiding expensive aggregations.

Capacity planning

Estimate disk usage as

retention_time_seconds × ingested_samples_per_second × bytes_per_sample

. Reduce series count or increase scrape intervals to lower storage needs. For remote‑write or Thanos setups, local disk can be minimal.

Apiserver performance impact

When using kubernetes_sd_config, Prometheus queries pass through the apiserver; large clusters may increase apiserver load, so consider direct node scraping after discovery.

Rate calculation logic

Counters are designed for rate functions; rate automatically handles counter resets. Use a range vector at least four times the scrape interval to ensure robustness against missing samples.

Author: Xu Yason

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

monitoringKubernetesPrometheusGrafana
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.