Operations 11 min read

Build a High‑Availability Prometheus Monitoring System from Scratch: Pitfalls & Performance Tuning

This guide walks you through constructing a production‑grade, highly available Prometheus monitoring stack, covering architecture choices, sharding strategies, common pitfalls such as memory bloat, query latency and storage growth, and provides concrete tuning steps, Kubernetes deployment examples, and advanced optimisation techniques.

Raymond Ops

Dec 22, 2025

Build a High‑Availability Prometheus Monitoring System from Scratch: Pitfalls & Performance Tuning

Why Choose Prometheus?

In cloud‑native environments the pull model, multi‑dimensional data model and powerful PromQL make Prometheus the de‑facto monitoring solution for micro‑service architectures.

Architecture Design: Foundations of High Availability

Core Architecture Principle

Federated cluster mode is recommended for production. Example configuration:

# Federated configuration example
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
- job_name: 'federate'
  scrape_interval: 15s
  honor_labels: true
  metrics_path: '/federate'
  params:
    'match[]':
    - '{job=~"kubernetes-.*"}'
    - '{__name__=~"job:.*"}'
  static_configs:
  - targets:
    - 'prometheus-shard1:9090'
    - 'prometheus-shard2:9090'

Sharding Strategy

Infrastructure sharding : monitor physical machines and network devices.

Application sharding : split by business lines.

Middleware sharding : databases, caches, message queues.

Production Pitfalls

Pitfall 1: Uncontrolled Memory Usage

Symptom : Prometheus memory keeps growing until OOM.

Root cause : High‑cardinality labels explode the time‑series count.

# Identify high‑cardinality labels
curl 'http://localhost:9090/api/v1/label/__name__/values' | jq '.data[]' | wc -l

# Check number of series in memory
curl 'http://localhost:9090/api/v1/query?query=prometheus_tsdb_symbol_table_size_bytes'

Solution : Limit label cardinality with metric relabeling.

# Drop high‑cardinality metrics
metric_relabel_configs:
- source_labels: [__name__]
  regex: 'high_cardinality_metric.*'
  action: drop
- source_labels: [user_id]
  regex: '.*'
  target_label: user_id
  replacement: 'masked'

Pitfall 2: Query Performance Issues

Symptom : Complex queries time out; Grafana panels load slowly.

Root cause : Query time range is too large, aggregation is inefficient.

# Bad: large time‑range aggregation
rate(http_requests_total[1d])

# Good: use recording rules
job:http_requests:rate5m

Pitfall 3: Storage Growth

Production workloads often exceed storage expectations.

# Storage optimisation
storage:
  tsdb:
    retention.time: 30d
    retention.size: 100GB
    min-block-duration: 2h
    max-block-duration: 36h

Performance Tuning in Practice

Memory Optimisation

Adjust system parameters and Prometheus start‑up flags.

# System‑level tuning
echo 'vm.max_map_count=262144' >> /etc/sysctl.conf
echo 'fs.file-max=65536' >> /etc/sysctl.conf

# Prometheus start‑up
./prometheus \
  --storage.tsdb.path=/data/prometheus \
  --storage.tsdb.retention.time=30d \
  --storage.tsdb.retention.size=100GB \
  --query.max-concurrency=20 \
  --query.max-samples=50000000

Recording Rules Optimisation

Pre‑compute complex queries to improve latency.

groups:
- name: http_requests
  interval: 30s
  rules:
  - record: job:http_requests:rate5m
    expr: sum(rate(http_requests_total[5m])) by (job)
  - record: job:http_requests_errors:rate5m
    expr: sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
  - record: job:http_requests_error_rate
    expr: job:http_requests_errors:rate5m / job:http_requests:rate5m

Storage Layer Optimisation

Remote storage (e.g., Thanos) solves long‑term retention.

# Remote write configuration
remote_write:
- url: "http://thanos-receive:19291/api/v1/receive"
  queue_config:
    max_samples_per_send: 10000
    batch_send_deadline: 5s
    max_shards: 200

High‑Availability Deployment Practices

Multi‑Replica Deployment

# Kubernetes StatefulSet for Prometheus
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: prometheus
spec:
  replicas: 2
  selector:
    matchLabels:
      app: prometheus
  template:
    spec:
      containers:
      - name: prometheus
        image: prom/prometheus:v2.45.0
        args:
        - '--storage.tsdb.path=/prometheus'
        - '--config.file=/etc/prometheus/prometheus.yml'
        - '--web.console.libraries=/etc/prometheus/console_libraries'
        - '--web.console.templates=/etc/prometheus/consoles'
        - '--web.enable-lifecycle'
        - '--web.enable-admin-api'
        resources:
          requests:
            memory: "4Gi"
            cpu: "1000m"
          limits:
            memory: "8Gi"
            cpu: "2000m"

Data Consistency with Thanos

# Thanos sidecar
- name: thanos-sidecar
  image: thanosio/thanos:v0.31.0
  args:
  - sidecar
  - --tsdb.path=/prometheus
  - --prometheus.url=http://localhost:9090
  - --objstore.config-file=/etc/thanos/objstore.yml

Key Metrics Monitoring

Prometheus Self‑Monitoring

# TSDB metrics
prometheus_tsdb_head_series
prometheus_tsdb_head_samples_appended_total
prometheus_config_last_reload_successful

# Query performance metrics
prometheus_engine_query_duration_seconds
prometheus_engine_queries_concurrent_max

Alerting Rules Design

groups:
- name: prometheus.rules
  rules:
  - alert: PrometheusConfigReloadFailed
    expr: prometheus_config_last_reload_successful == 0
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Prometheus configuration reload failed"
  - alert: PrometheusQueryHigh
    expr: rate(prometheus_engine_query_duration_seconds_sum[5m]) > 0.1
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "Prometheus query latency high"

Troubleshooting Techniques

Common Commands

# Validate configuration
./promtool check config prometheus.yml

# Validate rules
./promtool check rules /etc/prometheus/rules/*.yml

# Inspect TSDB status
curl localhost:9090/api/v1/status/tsdb

# Analyse query performance
curl 'localhost:9090/api/v1/query?query=up&stats=all'

Performance Analysis Tools

Use Go pprof to profile Prometheus.

# CPU profile
go tool pprof http://localhost:9090/debug/pprof/profile

# Heap profile
go tool pprof http://localhost:9090/debug/pprof/heap

Best‑Practice Summary

Tag Design Principles

Control cardinality : keep a single label value under 100 k.

Clear semantics : label names and values must be meaningful.

Reasonable hierarchy : avoid overly deep nesting.

Query Optimisation Strategies

Use recording rules to pre‑compute heavy metrics.

Limit query time range to avoid large‑scale aggregation.

Prefer efficient functions : rate() is faster than increase().

Storage Planning Advice

SSD storage : TSDB requires high I/O.

Reserve space : keep at least 50 % free.

Regular cleanup : configure appropriate retention policies.

Advanced Optimisation Directions

Automatic Scaling

Scale Prometheus clusters based on query load and storage utilisation.

Intelligent Routing

Route queries to the most suitable Prometheus instance according to query patterns.

Machine‑Learning‑Driven Optimisation

Predict resource demand with ML models to perform proactive capacity planning.

Conclusion

Building a highly available Prometheus monitoring system requires careful architecture, performance tuning, and robust troubleshooting. The practical tips and pitfalls shared here aim to help you deploy a stable, reliable monitoring solution that delivers actionable insights when they matter most.

Monitoring high availability Kubernetes alerting Prometheus Thanos

Written by

Raymond Ops

Linux ops automation, cloud-native, Kubernetes, SRE, DevOps, Python, Golang and related tech discussions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Why Choose Prometheus?

Architecture Design: Foundations of High Availability

Core Architecture Principle

Sharding Strategy

Production Pitfalls

Pitfall 1: Uncontrolled Memory Usage

Pitfall 2: Query Performance Issues

Pitfall 3: Storage Growth

Performance Tuning in Practice

Memory Optimisation

Recording Rules Optimisation

Storage Layer Optimisation

High‑Availability Deployment Practices

Multi‑Replica Deployment

Data Consistency with Thanos

Key Metrics Monitoring

Prometheus Self‑Monitoring

Alerting Rules Design

Troubleshooting Techniques

Common Commands

Performance Analysis Tools

Best‑Practice Summary

Tag Design Principles

Query Optimisation Strategies

Storage Planning Advice

Advanced Optimisation Directions

Automatic Scaling

Intelligent Routing

Machine‑Learning‑Driven Optimisation

Conclusion

Raymond Ops

How this landed with the community

Was this worth your time?

0 Comments

Pitfall 1: Uncontrolled Memory Usage

Pitfall 2: Query Performance Issues

Pitfall 3: Storage Growth