Build a High‑Availability Prometheus Monitoring System from Scratch: Pitfalls & Performance Tuning
This guide walks you through constructing a production‑grade, highly available Prometheus monitoring stack, covering architecture choices, sharding strategies, common pitfalls such as memory bloat, query latency and storage growth, and provides concrete tuning steps, Kubernetes deployment examples, and advanced optimisation techniques.
Why Choose Prometheus?
In cloud‑native environments the pull model, multi‑dimensional data model and powerful PromQL make Prometheus the de‑facto monitoring solution for micro‑service architectures.
Architecture Design: Foundations of High Availability
Core Architecture Principle
Federated cluster mode is recommended for production. Example configuration:
# Federated configuration example
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'federate'
scrape_interval: 15s
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{job=~"kubernetes-.*"}'
- '{__name__=~"job:.*"}'
static_configs:
- targets:
- 'prometheus-shard1:9090'
- 'prometheus-shard2:9090'Sharding Strategy
Infrastructure sharding : monitor physical machines and network devices.
Application sharding : split by business lines.
Middleware sharding : databases, caches, message queues.
Production Pitfalls
Pitfall 1: Uncontrolled Memory Usage
Symptom : Prometheus memory keeps growing until OOM.
Root cause : High‑cardinality labels explode the time‑series count.
# Identify high‑cardinality labels
curl 'http://localhost:9090/api/v1/label/__name__/values' | jq '.data[]' | wc -l
# Check number of series in memory
curl 'http://localhost:9090/api/v1/query?query=prometheus_tsdb_symbol_table_size_bytes'Solution : Limit label cardinality with metric relabeling.
# Drop high‑cardinality metrics
metric_relabel_configs:
- source_labels: [__name__]
regex: 'high_cardinality_metric.*'
action: drop
- source_labels: [user_id]
regex: '.*'
target_label: user_id
replacement: 'masked'Pitfall 2: Query Performance Issues
Symptom : Complex queries time out; Grafana panels load slowly.
Root cause : Query time range is too large, aggregation is inefficient.
# Bad: large time‑range aggregation
rate(http_requests_total[1d])
# Good: use recording rules
job:http_requests:rate5mPitfall 3: Storage Growth
Production workloads often exceed storage expectations.
# Storage optimisation
storage:
tsdb:
retention.time: 30d
retention.size: 100GB
min-block-duration: 2h
max-block-duration: 36hPerformance Tuning in Practice
Memory Optimisation
Adjust system parameters and Prometheus start‑up flags.
# System‑level tuning
echo 'vm.max_map_count=262144' >> /etc/sysctl.conf
echo 'fs.file-max=65536' >> /etc/sysctl.conf
# Prometheus start‑up
./prometheus \
--storage.tsdb.path=/data/prometheus \
--storage.tsdb.retention.time=30d \
--storage.tsdb.retention.size=100GB \
--query.max-concurrency=20 \
--query.max-samples=50000000Recording Rules Optimisation
Pre‑compute complex queries to improve latency.
groups:
- name: http_requests
interval: 30s
rules:
- record: job:http_requests:rate5m
expr: sum(rate(http_requests_total[5m])) by (job)
- record: job:http_requests_errors:rate5m
expr: sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
- record: job:http_requests_error_rate
expr: job:http_requests_errors:rate5m / job:http_requests:rate5mStorage Layer Optimisation
Remote storage (e.g., Thanos) solves long‑term retention.
# Remote write configuration
remote_write:
- url: "http://thanos-receive:19291/api/v1/receive"
queue_config:
max_samples_per_send: 10000
batch_send_deadline: 5s
max_shards: 200High‑Availability Deployment Practices
Multi‑Replica Deployment
# Kubernetes StatefulSet for Prometheus
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: prometheus
spec:
replicas: 2
selector:
matchLabels:
app: prometheus
template:
spec:
containers:
- name: prometheus
image: prom/prometheus:v2.45.0
args:
- '--storage.tsdb.path=/prometheus'
- '--config.file=/etc/prometheus/prometheus.yml'
- '--web.console.libraries=/etc/prometheus/console_libraries'
- '--web.console.templates=/etc/prometheus/consoles'
- '--web.enable-lifecycle'
- '--web.enable-admin-api'
resources:
requests:
memory: "4Gi"
cpu: "1000m"
limits:
memory: "8Gi"
cpu: "2000m"Data Consistency with Thanos
# Thanos sidecar
- name: thanos-sidecar
image: thanosio/thanos:v0.31.0
args:
- sidecar
- --tsdb.path=/prometheus
- --prometheus.url=http://localhost:9090
- --objstore.config-file=/etc/thanos/objstore.ymlKey Metrics Monitoring
Prometheus Self‑Monitoring
# TSDB metrics
prometheus_tsdb_head_series
prometheus_tsdb_head_samples_appended_total
prometheus_config_last_reload_successful
# Query performance metrics
prometheus_engine_query_duration_seconds
prometheus_engine_queries_concurrent_maxAlerting Rules Design
groups:
- name: prometheus.rules
rules:
- alert: PrometheusConfigReloadFailed
expr: prometheus_config_last_reload_successful == 0
for: 5m
labels:
severity: warning
annotations:
summary: "Prometheus configuration reload failed"
- alert: PrometheusQueryHigh
expr: rate(prometheus_engine_query_duration_seconds_sum[5m]) > 0.1
for: 2m
labels:
severity: warning
annotations:
summary: "Prometheus query latency high"Troubleshooting Techniques
Common Commands
# Validate configuration
./promtool check config prometheus.yml
# Validate rules
./promtool check rules /etc/prometheus/rules/*.yml
# Inspect TSDB status
curl localhost:9090/api/v1/status/tsdb
# Analyse query performance
curl 'localhost:9090/api/v1/query?query=up&stats=all'Performance Analysis Tools
Use Go pprof to profile Prometheus.
# CPU profile
go tool pprof http://localhost:9090/debug/pprof/profile
# Heap profile
go tool pprof http://localhost:9090/debug/pprof/heapBest‑Practice Summary
Tag Design Principles
Control cardinality : keep a single label value under 100 k.
Clear semantics : label names and values must be meaningful.
Reasonable hierarchy : avoid overly deep nesting.
Query Optimisation Strategies
Use recording rules to pre‑compute heavy metrics.
Limit query time range to avoid large‑scale aggregation.
Prefer efficient functions : rate() is faster than increase().
Storage Planning Advice
SSD storage : TSDB requires high I/O.
Reserve space : keep at least 50 % free.
Regular cleanup : configure appropriate retention policies.
Advanced Optimisation Directions
Automatic Scaling
Scale Prometheus clusters based on query load and storage utilisation.
Intelligent Routing
Route queries to the most suitable Prometheus instance according to query patterns.
Machine‑Learning‑Driven Optimisation
Predict resource demand with ML models to perform proactive capacity planning.
Conclusion
Building a highly available Prometheus monitoring system requires careful architecture, performance tuning, and robust troubleshooting. The practical tips and pitfalls shared here aim to help you deploy a stable, reliable monitoring solution that delivers actionable insights when they matter most.
Raymond Ops
Linux ops automation, cloud-native, Kubernetes, SRE, DevOps, Python, Golang and related tech discussions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
