Practical Prometheus in Kubernetes: Tips, Limits, and Scaling
This article shares practical experiences and best‑practice guidelines for deploying and operating Prometheus in Kubernetes, covering version selection, inherent limitations, exporter choices, metric design, multi‑cluster scraping, memory and storage planning, GPU monitoring, timezone handling, and alerting considerations.
Monitoring systems have a long history, and Prometheus, as a new‑generation open‑source solution, has become the de‑facto standard in cloud‑native environments.
The article shares practical issues and thoughts encountered when using Prometheus, and suggests reading the container‑monitoring series for background.
Key principles:
Monitoring is infrastructure; collect only necessary metrics to avoid waste of manpower and storage (except for B2B commercial products).
Only fire alerts that can be acted upon.
Keep the architecture simple; the monitoring system must stay up even if the business system fails. Avoid magic systems such as ML‑based thresholds or auto‑remediation.
1. Version selection
Use the latest Prometheus version (e.g., 2.16); older 1.x versions are obsolete. Version 2.16 includes an experimental UI to view TSDB status, top labels, and metrics.
2. Limitations of Prometheus
Metric‑based monitoring; does not handle logs, events, or tracing.
Pull model by default; plan network topology to avoid unnecessary forwarding.
No silver‑bullet solution for clustering and horizontal scaling; choose between federation, Cortex, Thanos, etc.
Typically favors availability over consistency, tolerating some data loss.
Functions like
rateand
histogram_quantilemay produce unintuitive results; long query ranges cause down‑sampling and loss of precision.
3. Common exporters in a K8s cluster
Prometheus, as a CNCF project, offers a rich ecosystem of exporters. Some frequently used exporters include:
cAdvisor (integrated in Kubelet)
Kubelet (port 10255 unauthenticated, 10250 authenticated)
apiserver (port 6443, metrics such as request count and latency)
scheduler (port 10251)
controller‑manager (port 10252)
etcd (write/read latency, storage capacity)
docker (requires experimental flag, metrics‑addr for container creation time, etc.)
kube‑proxy (default 127.0.0.1, port 10249; can expose 0.0.0.0 for external scraping)
kube‑state‑metrics (metadata of pods, deployments, etc.)
node‑exporter (CPU, memory, disk metrics)
blackbox_exporter (network probes: DNS, ping, HTTP)
process‑exporter (process metrics)
nvidia exporter (GPU metrics)
node‑problem‑detector (reports node health taints)
Application exporters (MySQL, Nginx, MQ, etc.)
Custom exporters can also be written for specific scenarios.
4. Monitoring core K8s components with Grafana dashboards
Using the exporters above, Grafana can render dashboards for components such as kubelet and apiserver.
Templates can be based on
dashboards-for-kubernetes-administratorsand adjusted as needed. Grafana supports templated dropdowns but currently lacks template‑based alert rule configuration.
<code>It would be grate to add templates support in alerts. Otherwise the feature looks useless a bit.</code>5. All‑in‑One collection component
Exporters are independent, increasing operational overhead. Two approaches to combine them:
Launch a main process that starts multiple exporter processes, still following community version updates.
Use Telegraf to handle various input types, consolidating N exporters into one.
Node‑exporter does not monitor processes; a process‑exporter or Telegraf with
procstatinput can fill this gap.
6. Choosing golden metrics
Google’s SRE handbook defines four golden signals: latency, traffic, errors, and saturation. In practice, use the Use method for resources (Utilization, Saturation, Errors) and the Red method for services (Rate, Errors, Duration).
Use method: Utilization, Saturation, Errors (e.g., cAdvisor data).
Red method: Rate, Errors, Duration (e.g., apiserver performance metrics).
Service categories:
Online services – focus on request rate, latency, error rate (Red).
Offline services – monitor queue length, in‑flight count, processing speed, errors (Use).
Batch jobs – monitor duration and error count; often use Pushgateway for short‑lived jobs.
7. Cadvisor label compatibility in K8s 1.16
K8s 1.16 removed
pod_nameand
container_namelabels, replacing them with
podand
container. Adjust queries or Grafana panels accordingly, using relabeling to restore original names.
<code>metric_relabel_configs:
- source_labels: [container]
regex: (.+)
target_label: container_name
replacement: $1
action: replace
- source_labels: [pod]
regex: (.+)
target_label: pod_name
replacement: $1
action: replace</code>8. Scraping external or multiple K8s clusters
When Prometheus runs outside a cluster, certificates and tokens are required. Example job for scraping cadvisor via the apiserver proxy:
<code>- job_name: cluster-cadvisor
honor_timestamps: true
scrape_interval: 30s
scrape_timeout: 10s
metrics_path: /metrics
scheme: https
kubernetes_sd_configs:
- api_server: https://xx:6443
role: node
bearer_token_file: token/cluster.token
tls_config:
insecure_skip_verify: true
bearer_token_file: token/cluster.token
tls_config:
insecure_skip_verify: true
relabel_configs:
- separator: ;
regex: __meta_kubernetes_node_label_(.+)
replacement: $1
action: labelmap
- separator: ;
regex: (.*)
target_label: __address__
replacement: xx:6443
action: replace
- source_labels: [__meta_kubernetes_node_name]
separator: ;
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
action: replace
metric_relabel_configs:
- source_labels: [container]
separator: ;
regex: (.+)
target_label: container_name
replacement: $1
action: replace
- source_labels: [pod]
separator: ;
regex: (.+)
target_label: pod_name
replacement: $1
action: replace</code>For endpoint‑type services (e.g., kube‑state‑metrics), adjust
__metrics_path__accordingly.
9. Collecting GPU metrics
nvidia‑smi shows GPU resources; cadvisor exposes metrics such as:
<code>container_accelerator_duty_cycle
container_accelerator_memory_total_bytes
container_accelerator_memory_used_bytes</code>For richer GPU data, install the dcgm‑exporter (requires K8s 1.13+).
10. Changing Prometheus display timezone
Prometheus stores timestamps as Unix time (UTC) and does not support timezone configuration.
Grafana can perform timezone conversion for visualisation.
The UI can show timestamps in local timezone starting from version 2.16.
Modifying Prometheus code is possible but not recommended.
11. Scraping metrics behind a Load Balancer
Add a sidecar proxy to the backend service or deploy a proxy on the node to allow Prometheus access.
Configure the LB to forward specific paths (e.g., /backend1, /backend2) to the backends, then scrape the LB.
12. Prometheus large‑memory issues
Memory consumption grows with ingestion rate because data is kept in memory for the 2‑hour block before flushing to disk. Large query ranges and expensive functions (e.g.,
group, wide
rate) also increase memory usage.
Optimization suggestions:
Shard when series exceed ~2 million; use Thanos, VictoriaMetrics, etc., for aggregation.
Identify and drop high‑cost metrics/labels.
Avoid broad queries; keep time range and step ratio reasonable; limit use of
group.
Prefer relabeling over joins for related data.
13. Capacity planning
Memory: depends on ingestion rate and block size; reduce series count or increase scrape interval.
Disk: calculate as retention_time_seconds × samples_per_second × bytes_per_sample. Reduce series count or sample rate to lower disk usage.
For single‑node Prometheus, estimate local disk usage; for remote‑write or Thanos, consider object‑storage size.
Example PromQL to monitor sample rate:
<code>rate(prometheus_tsdb_head_samples_appended_total[1h])</code>14. Impact on Apiserver performance
When using
kubernetes_sd_config, Prometheus queries the apiserver, which can increase CPU load at large scale. Direct node scraping can reduce apiserver pressure.
15. Rate calculation logic
Prometheus
rateworks on counter metrics, handling resets automatically. Because scrape intervals vary, rate values are approximate. Recommended to set the rate window at least four times the scrape interval to ensure enough samples.
When data gaps occur,
rateextrapolates based on trends, which may produce misleading results.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.