Operations 33 min read

Mastering Prometheus: Principles, Pitfalls, and Scaling Strategies

This article explores Prometheus as a cloud‑native monitoring solution, covering core principles, limitations, metric selection, exporter consolidation, Kubernetes deployment nuances, memory and storage planning, high‑availability designs, and advanced features like rate calculations, cardinality management, and predictive alerts.

Programmer DD
Programmer DD
Programmer DD
Mastering Prometheus: Principles, Pitfalls, and Scaling Strategies

Prometheus is a modern open‑source monitoring system that has become the de‑facto standard in cloud‑native environments. This article shares practical experiences, principles, limitations, and best‑practice configurations when using Prometheus in production.

Principles

Monitoring is infrastructure; focus on solving problems, avoid unnecessary metric collection that wastes resources (except for B2B products).

Only emit alerts that can be acted upon.

Keep the architecture simple; the monitoring system must stay up even when the business system fails. Avoid magic systems such as AI‑driven thresholds or auto‑remediation unless absolutely needed.

Prometheus Limitations

Metric‑based monitoring; not suitable for logs, events, or tracing.

Default pull model; plan the network carefully and avoid using pushgateway when possible.

No silver bullet for clustering or horizontal scaling; choose between federation, Cortex, or Thanos.

Generally prioritize availability over consistency; see the Thanos section for details.

Choosing Golden Metrics

Google’s SRE Handbook defines four golden signals: latency, traffic, error count, and saturation. In practice you can use the USE (Utilization, Saturation, Errors) method for resources or the RED (Rate, Errors, Duration) method for services.

USE: Utilization, Saturation, Errors

RED: Rate, Errors, Duration

All‑in‑One Exporter Collection

Exporters are usually separate processes (node‑exporter, NVIDIA exporter, etc.). To reduce operational overhead you can combine them in two ways:

Launch multiple exporter processes from a single main process, still updating each exporter independently.

Use Telegraf to handle many input types, consolidating them into a single agent.

Node‑exporter does not expose process metrics; you can add process‑exporter or use Telegraf as described.

Kubernetes 1.16 Cadvisor Label Changes

In Kubernetes 1.16 the pod_name and container_name labels were replaced by pod and container. Adjust your queries or Grafana dashboards accordingly. The following relabel config restores the original _name labels:

metric_relabel_configs:
- source_labels: [container]
  regex: (.+)
  target_label: container_name
  replacement: $1
  action: replace
- source_labels: [pod]
  regex: (.+)
  target_label: pod_name
  replacement: $1
  action: replace

Note: use metric_relabel_configs, not relabel_configs, for post‑scrape replacement.

Deploying Prometheus Inside vs. Outside the Cluster

Running Prometheus inside a Kubernetes cluster is straightforward with the official YAML. When deployed outside (due to permission or network constraints) you run the binary on dedicated servers and configure service discovery manually.

In‑cluster mode does not require certificates (in‑cluster mode). Outside the cluster you must provide a bearer token and TLS settings, for example:

kubernetes_sd_configs:
- api_server: https://xx:6443
  role: node
  bearer_token_file: token/xx.token
  tls_config:
    insecure_skip_verify: true
relabel_configs:
- separator: ;
  regex: __meta_kubernetes_node_label_(.+)
  replacement: $1
  action: labelmap
- separator: ;
  regex: .*
  target_label: __address__
  replacement: xx:6443
  action: replace
- source_labels: [__meta_kubernetes_node_name]
  separator: ;
  regex: (.+)
  target_label: __metrics_path__
  replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
  action: replace

GPU Metrics Collection

Basic GPU information can be obtained with nvidia‑smi. Cadvisor also exposes container‑level GPU metrics such as:

container_accelerator_duty_cycle
container_accelerator_memory_total_bytes
container_accelerator_memory_used_bytes

For more detailed GPU data install the dcgm‑exporter (supported from Kubernetes 1.13 onward).

Changing Prometheus Timezone Display

Prometheus stores timestamps as Unix time in UTC and does not allow configuring a timezone in the config file or reading the host /etc/timezone. Work‑arounds:

Grafana can convert UTC to the desired timezone.

When using the UI, the 2.16 web UI adds a local‑timezone option (see the screenshots).

For custom UI changes, modify the source code as described in the referenced article.

Collecting Metrics Behind a Load Balancer

If Prometheus can only reach the load balancer (LB) but not the backend replica set (RS), consider:

Deploy a sidecar proxy on the RS service or add a local proxy so Prometheus can scrape the RS directly.

Configure the LB to forward traffic to separate backends and let Prometheus scrape each backend individually.

Version

The latest stable version at the time of writing is 2.16. Use the newest release; older 1.x versions are no longer maintained.

Version 2.16 includes an experimental UI that shows TSDB status, top‑10 labels, and metrics.

Prometheus Memory Issues

As scale grows, CPU and memory usage increase, with memory often becoming the bottleneck. Memory consumption is mainly due to the 2‑hour block write‑ahead process that keeps all recent data in RAM.

Loading historic data from disk also consumes memory; large queries exacerbate this.

Inefficient queries (e.g., large group or wide‑range rate) increase memory pressure.

Use the calculator from Robust Perception to estimate required RAM based on series count and scrape interval.

Example: a Prometheus server retaining 2 hours of data with 950 k series consumes roughly the memory shown in the following diagram.

Optimization suggestions:

When series exceed ~2 M, move to sharding with Thanos, VictoriaMetrics, or similar.

Identify high‑cardinality metrics and drop unnecessary ones (2.14+ provides TSDB status).

Avoid large‑range queries; keep step and time range proportional.

Prefer label‑based filtering over joins; time‑series databases are not relational.

Capacity Planning

Beyond memory, plan disk storage based on retention time, ingest rate, and sample size. For a single‑node Prometheus, calculate local disk usage; for remote‑write setups, consider object‑storage size; for Thanos, local disk can be ignored.

Prometheus compresses in‑memory data into blocks every 2 hours. Rough storage estimate:

Disk size = retention_time_seconds * ingested_samples_per_second * bytes_per_sample

To reduce disk usage without changing retention or sample size, lower the ingest rate.

Current ingest rate can be observed with:

rate(prometheus_tsdb_head_samples_appended_total[1h])

Two main ways to lower ingest volume:

Reduce the number of time series.

Increase the scrape interval.

Example: 30 s scrape interval, 1 000 nodes, 6 000 metric types → ~30 GB disk.

Impact on the API Server

When using kubernetes_sd_config, Prometheus queries the API server for service discovery. At large scale this can increase API server CPU usage, especially if proxy failures occur. Monitoring the API server process is advisable.

Rate Calculation Logic

Counters exist primarily for rate() calculations. Counters reset on restart; rate() automatically handles resets. However, rate values are approximate due to scrape timing variance and possible missing samples.

Best practice: set the rate range vector to at least four times the scrape interval (e.g., 4 min for a 1 min scrape).

Counter‑Intuitive P95 Statistics

Histogram quantiles (e.g., P95) represent the value below which a given percentage of observations fall. Because they are based on distribution, P95 can be lower or higher than the average, especially with skewed data.

Slow Query Issues

Prometheus provides the prometheus_engine_query_duration_seconds metric to monitor query latency. Slow queries usually stem from:

Heavy joins or excessive label additions.

Wide time ranges with small step values.

Improper use of rate() without sufficient range.

Missing data causing partial results.

Use recording rules to pre‑aggregate expensive queries.

High Cardinality Problems

High cardinality (many unique label values) inflates the number of time series and index size, degrading performance. Avoid using unbounded labels such as user IDs or IP addresses as metric labels. Instead, log such data or use separate systems.

Prometheus documentation warns against high‑cardinality labels.

Current label distribution can be inspected with the tsdb analyze tool or via the TSDB status UI (2.14+).

[work@xxx bin]$ ./tsdb analyze ../data/prometheus/
Block ID: 01E41588AJNGM31SPGHYA3XSXG
Duration: 2h0m0s
Series: 955372
Label names: 301
Postings (unique label pairs): 30757
Postings entries (total label pairs): 10842822
...

Top‑10 high‑cardinality metrics and labels are listed in the source.

Finding the Largest Metric or Job

topk(10, count by (__name__)({__name__=~".+"}))
# Example output:
apiserver_request_latencies_bucket{}  62544
apiserver_response_sizes_bucket{}   44600
topk(10, count by (__name__, job)({__name__=~".+"}))
# Example output:
{job="master-scrape"} 525667
{job="xxx-kubernetes-cadvisor"} 50817

Kubernetes Component Metrics

Key component ports to scrape:

10250 – kubelet (authenticated) – metrics, /stats/summary, /metrics/cadvisor.

10251 – kube‑scheduler metrics (no auth for localhost).

10252 – kube‑controller‑manager metrics (no auth for localhost).

6443 – apiserver (requires client certificates).

2379 – etcd (requires client certificates).

Docker metrics require enabling the experimental metrics‑addr in /etc/docker/daemon.json:

{
  "metrics-addr": "127.0.0.1:9323",
  "experimental": true
}

Kube‑proxy exposes metrics on port 10249.

Prometheus Restart Slowness

During restart Prometheus loads the WAL into memory; larger WAL files and longer retention increase restart time. Reloading configuration is preferred over full restarts. Version 2.6 introduced WAL load optimizations aiming for sub‑minute restarts.

How Many Metrics Should an Application Expose?

Brian Brazil recommends keeping metric count modest: ~120 metrics for simple services, ~700 for Prometheus itself, and try not to exceed 10 000 for large applications. Control label cardinality carefully.

relab​el_configs vs metric_relabel_configs

relabel_config

runs before scraping; metric_relabel_configs runs after data is collected. Example relabel configuration:

metric_relabel_configs:
- separator: ;
  regex: instance
  replacement: $1
  action: labeldrop

Example service‑discovery relabel:

- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_endpoints_name, __meta_kubernetes_service_annotation_prometheus_io_port]
  separator: ;
  regex: (.+);(.+);(.*)
  target_label: __metrics_path__
  replacement: /api/v1/namespaces/${1}/services/${2}:${3}/proxy/metrics
  action: replace

Prometheus Predictive Capabilities

Use deriv() to compute the rate of change and predict_linear() to forecast future values. Example: predict when free memory will drop below a threshold.

predict_linear(mem_free{instanceIP="100.75.155.55"}[1h], 2*3600)/1024/1024 < 10

For more advanced forecasting, see the forecast‑prometheus project.

Incorrect High‑Availability Design

Some architectures push metrics to a message queue (e.g., Kafka) and have an exporter pull from the queue. This adds extra components, introduces latency, breaks service‑discovery semantics, and can become a bottleneck.

Proper High‑Availability Solutions

Recommended HA approaches:

Run two identical Prometheus instances behind a load balancer.

Combine HA with remote‑write to a durable backend.

Use federation to shard data and a global node for aggregation.

Adopt Thanos or VictoriaMetrics for global query, deduplication, and long‑term storage.

Even with federation, the global node can be a single point of failure; consider duplicating shards as well. Remote‑write adapters can provide leader election to ensure only one instance pushes data.

Log Monitoring

Container logs are usually collected with Fluentd/Fluent‑Bit/Filebeat and sent to Elasticsearch. To turn logs into metrics, use tools like grok or mtail and expose them via exporters.

References

[1] Container monitoring practice – K8s common metrics analysis.

[2] dcgm exporter – https://github.com/NVIDIA/gpu-monitoring-tools/tree/master/exporters/prometheus-dcgm

[3] Prometheus 2.16 UI timezone support.

[4] Article on changing Prometheus timezone.

[5] GitHub issue about timezone handling.

[6] TSDB status documentation.

[7] Article on Prometheus storage mechanism.

[8] Video on histogram_quantile behavior.

[9] PromQL basics article.

[10] Step parameter guidance.

[11] Staleness handling in PromQL.

[12] Optimising Prometheus 2.6 startup time.

[13] How many metrics should an application return.

[14] forecast‑prometheus project.

[15] Staleness handling logic.

[16] scrape_limit usage.

[17] HA with PostgreSQL remote write.

[18] Thanos multi‑region monitoring article.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

monitoringKubernetesmetricsPrometheusHA
Programmer DD
Written by

Programmer DD

A tinkering programmer and author of "Spring Cloud Microservices in Action"

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.