Build a Production‑Ready Prometheus HA Architecture with Federation and Remote Storage
Learn how to design and implement a robust, production‑grade Prometheus high‑availability solution using a federated global cluster, multiple business‑level instances, remote storage with Thanos or VictoriaMetrics, Docker‑Compose deployment, health‑check scripts, performance metrics, alerting rules, and best‑practice operational guidelines.
Introduction
When a monitoring system crashes, the article explains why Prometheus single‑node deployments are a single point of failure and why you need a HA solution.
HA Architecture Design
Problems of a single Prometheus instance
# Traditional single‑node problems
problems:
- 单点故障: 服务器宕机 = 监控全瞎
- 存储限制: 本地磁盘空间有限
- 查询性能: 大量历史数据查询缓慢
- 扩展困难: 无法水平扩容Core HA principles
Data not lost + Service uninterrupted + High‑performance queries
Application layer HA : multiple Prometheus instances behind a load balancer
Data layer HA : remote storage with data replication
Query layer HA : federation cluster + query sharding
Federation Cluster in Practice
Architecture diagram
┌─────────────────┐
│ Global Prometheus │
│ (Federation) │
└─────────┬───────┘
│
┌─────────────────┼─────────────────┐
│ │ │
┌──────▼───────┐ ┌──────▼───────┐ ┌─────▼───────┐
│ Prometheus‑1 │ │ Prometheus‑2 │ │ Prometheus‑N │
│ (Business A)│ │ (Business B) │ │ (Infra) │
└─────────────┘ └─────────────┘ └───────────────┘Global Prometheus configuration (prometheus‑global.yml)
global:
scrape_interval: 30s
evaluation_interval: 30s
external_labels:
cluster: 'global'
replica: '1'
rule_files:
- "global_rules.yml"
scrape_configs:
# Federation scrape
- job_name: 'federate-business-a'
scrape_interval: 15s
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{job=~"business-a-.*"}'
- 'up{job=~"business-a-.*"}'
- 'http_requests_total{job=~"business-a-.*"}'
- 'mysql_up{job=~"business-a-.*"}'
static_configs:
- targets: ['prometheus-business-a:9090']
- job_name: 'federate-business-b'
scrape_interval: 15s
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{job=~"business-b-.*"}'
- 'up{job=~"business-b-.*"}'
- 'redis_up{job=~"business-b-.*"}'
static_configs:
- targets: ['prometheus-business-b:9090']
# Remote write to Thanos
remote_write:
- url: "http://thanos-receive:19291/api/v1/receive"
queue_config:
max_samples_per_send: 1000
capacity: 10000
max_shards: 200Business Prometheus configuration example
# prometheus-business-a.yml
global:
scrape_interval: 15s
external_labels:
cluster: 'business-a'
replica: 'a1'
scrape_configs:
- job_name: 'business-a-web'
static_configs:
- targets: ['web1:8080', 'web2:8080']
- job_name: 'business-a-mysql'
static_configs:
- targets: ['mysql-exporter:9104']
storage:
tsdb:
retention.time: 7d
retention.size: 50GB
remote_write:
- url: "http://thanos-receive:19291/api/v1/receive"Docker‑Compose deployment file
version: '3.8'
services:
# Global Prometheus
prometheus-global:
image: prom/prometheus:v2.45.0
container_name: prometheus-global
ports:
- "9090:9090"
volumes:
- ./config/prometheus-global.yml:/etc/prometheus/prometheus.yml
- prometheus-global-data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=30d'
- '--web.enable-lifecycle'
- '--web.enable-admin-api'
# Business A Prometheus
prometheus-business-a:
image: prom/prometheus:v2.45.0
container_name: prometheus-business-a
ports:
- "9091:9090"
volumes:
- ./config/prometheus-business-a.yml:/etc/prometheus/prometheus.yml
- prometheus-a-data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=7d'
# Business B Prometheus
prometheus-business-b:
image: prom/prometheus:v2.45.0
container_name: prometheus-business-b
ports:
- "9092:9090"
volumes:
- ./config/prometheus-business-b.yml:/etc/prometheus/prometheus.yml
- prometheus-b-data:/prometheus
volumes:
prometheus-global-data:
prometheus-a-data:
prometheus-b-data:Remote Storage Options and Practical Deployment
Comparison of storage solutions
Thanos – open source, S3‑compatible, high compression; downside: complex configuration, higher ops cost; best for large‑scale, cost‑sensitive scenarios.
VictoriaMetrics – excellent performance, high compression, good compatibility; downside: newer, smaller community; best for high‑performance needs.
Cortex – full feature set, multi‑tenant support; downside: complex architecture, high resource consumption; best for enterprise‑grade multi‑tenant environments.
Recommended combo: VictoriaMetrics (single‑node) for moderate scale plus Thanos for massive scale.
VictoriaMetrics deployment
# docker-compose-victoria.yml
version: '3.8'
services:
victoria-metrics:
image: victoriametrics/victoria-metrics:v1.93.4
container_name: victoria-metrics
ports:
- "8428:8428"
volumes:
- victoria-data:/victoria-metrics-data
command:
- '--storageDataPath=/victoria-metrics-data'
- '--retentionPeriod=1y'
- '--memory.allowedPercent=80'
- '--search.maxQueryDuration=60s'
- '--search.maxQueryLength=16384'
vmagent: # optional sidecar
image: victoriametrics/vmagent:v1.93.4
ports:
- "8429:8429"
volumes:
- ./vmagent.yml:/etc/vmagent/vmagent.yml
command:
- '-promscrape.config=/etc/vmagent/vmagent.yml'
- '-remoteWrite.url=http://victoria-metrics:8428/api/v1/write'
volumes:
victoria-data:Thanos full deployment
# docker-compose-thanos.yml
version: '3.8'
services:
# Thanos Sidecar for the global Prometheus
thanos-sidecar-global:
image: thanosio/thanos:v0.32.2
container_name: thanos-sidecar-global
command:
- sidecar
- '--tsdb.path=/prometheus'
- '--prometheus.url=http://prometheus-global:9090'
- '--grpc-address=0.0.0.0:10901'
- '--http-address=0.0.0.0:10902'
- '--objstore.config-file=/etc/thanos/bucket.yml'
volumes:
- prometheus-global-data:/prometheus
- ./thanos/bucket.yml:/etc/thanos/bucket.yml
ports:
- "10901:10901"
- "10902:10902"
# Thanos Store Gateway
thanos-store:
image: thanosio/thanos:v0.32.2
container_name: thanos-store
command:
- store
- '--data-dir=/var/thanos/store'
- '--objstore.config-file=/etc/thanos/bucket.yml'
- '--grpc-address=0.0.0.0:10901'
- '--http-address=0.0.0.0:10902'
volumes:
- ./thanos/bucket.yml:/etc/thanos/bucket.yml
- thanos-store-data:/var/thanos/store
ports:
- "10901:10901"
- "10902:10902"
# Thanos Querier
thanos-query:
image: thanosio/thanos:v0.32.2
container_name: thanos-query
command:
- query
- '--grpc-address=0.0.0.0:10901'
- '--http-address=0.0.0.0:9090'
- '--store=thanos-sidecar-global:10901'
- '--store=thanos-store:10901'
- '--query.replica-label=replica'
ports:
- "9099:9090"
# Thanos Compactor
thanos-compact:
image: thanosio/thanos:v0.32.2
container_name: thanos-compact
command:
- compact
- '--data-dir=/var/thanos/compact'
- '--objstore.config-file=/etc/thanos/bucket.yml'
- '--retention.resolution-raw=7d'
- '--retention.resolution-5m=30d'
- '--retention.resolution-1h=1y'
volumes:
- ./thanos/bucket.yml:/etc/thanos/bucket.yml
- thanos-compact-data:/var/thanos/compact
volumes:
prometheus-global-data:
thanos-store-data:
thanos-compact-data:S3 storage configuration for Thanos
# thanos/bucket.yml
type: S3
config:
bucket: "prometheus-thanos"
endpoint: "s3.amazonaws.com" # or MinIO: "minio:9000"
region: "us-east-1"
access_key: "YOUR_ACCESS_KEY"
secret_key: "YOUR_SECRET_KEY"
insecure: false
signature_version2: false
encrypt_sse: false
put_user_metadata:
"X-Amz-Acl": "bucket-owner-full-control"
http_config:
idle_conn_timeout: 90s
response_header_timeout: 2mHA Validation and Failure Drills
Service health‑check script
#!/bin/bash
# health_check.sh
check_prometheus() {
local name=$1
local url=$2
if curl -s "${url}/api/v1/query?query=up" | grep -q "success"; then
echo "✅ ${name} is healthy"
return 0
else
echo "❌ ${name} is down"
return 1
fi
}
echo "=== Prometheus cluster health check ==="
check_prometheus "Global Prometheus" "http://localhost:9090"
check_prometheus "Business‑A Prometheus" "http://localhost:9091"
check_prometheus "Business‑B Prometheus" "http://localhost:9092"
check_prometheus "Thanos Query" "http://localhost:9099"
echo "=== Storage backend check ==="
if curl -s "http://localhost:8428/metrics" | grep -q "vm_"; then
echo "✅ VictoriaMetrics is healthy"
else
echo "❌ VictoriaMetrics is down"
fiFailure‑switch test
# Simulate Prometheus instance failure
docker stop prometheus-business-a
# Verify federation still works
curl "http://localhost:9090/api/v1/query?query=up{job=~'business-a-.*'}"
# Verify query via Thanos
curl "http://localhost:9099/api/v1/query?query=up{job=~'business-a-.*'}"
# Restore instance
docker start prometheus-business-aData consistency verification script
#!/bin/bash
# data_consistency_check.sh
QUERY="up"
PROM_URL="http://localhost:9090"
THANOS_URL="http://localhost:9099"
echo "Checking data consistency..."
PROM_RESULT=$(curl -s "${PROM_URL}/api/v1/query?query=${QUERY}" | jq '.data.result | length')
THANOS_RESULT=$(curl -s "${THANOS_URL}/api/v1/query?query=${QUERY}" | jq '.data.result | length')
echo "Prometheus result count: ${PROM_RESULT}"
echo "Thanos result count: ${THANOS_RESULT}"
if [ "${PROM_RESULT}" -eq "${THANOS_RESULT}" ]; then
echo "✅ Data consistency check passed"
else
echo "⚠️ Data inconsistency detected, investigate configuration"
fiPerformance Metrics and Alert Rules
Key metrics to monitor Prometheus itself
key_metrics:
- prometheus_tsdb_head_samples_appended_total # write rate
- prometheus_tsdb_compactions_total # compaction ops
- prometheus_rule_evaluation_duration_seconds # rule eval time
- prometheus_config_last_reload_success_timestamp_seconds # reload success
- go_memstats_alloc_bytes # memory usageAlerting rules (prometheus_alerts.yml)
groups:
- name: prometheus-ha
rules:
- alert: PrometheusDown
expr: up{job="prometheus"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Prometheus instance {{ $labels.instance }} is down"
description: "Prometheus instance has been down for more than 1 minute"
- alert: PrometheusConfigReloadFailed
expr: prometheus_config_last_reload_successful != 1
for: 5m
labels:
severity: warning
annotations:
summary: "Prometheus configuration reload failed"
- alert: ThanosQueryDown
expr: up{job="thanos-query"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Thanos Query service unavailable"Production Best Practices
Resource planning
production_specs:
prometheus_global:
cpu: "2 cores"
memory: "8GB"
disk: "200GB SSD"
prometheus_business:
cpu: "1 core"
memory: "4GB"
disk: "100GB SSD"
victoria_metrics:
cpu: "4 cores"
memory: "16GB"
disk: "1TB SSD"
thanos_components:
cpu: "1 core each"
memory: "2GB each"
disk: "50GB each"Security hardening checklist
✅ Enable HTTPS for transport encryption
✅ Configure access authentication (Basic Auth / OAuth)
✅ Restrict network access with firewall rules
✅ Regularly update component versions
✅ Monitor abnormal access logs
✅ Backup critical configuration files
Operational automation scripts
#!/bin/bash
# prometheus_maintenance.sh
# Backup configuration
backup_config() {
DATE=$(date +%Y%m%d_%H%M%S)
tar -czf "/backup/prometheus_config_${DATE}.tar.gz" ./config/
echo "Configuration backed up to: /backup/prometheus_config_${DATE}.tar.gz"
}
# Rolling restart of services
rolling_restart() {
services=("prometheus-business-a" "prometheus-business-b" "prometheus-global")
for service in "${services[@]}"; do
echo "Restarting ${service}..."
docker restart "${service}"
sleep 30 # wait for stability
if ! docker ps | grep -q "${service}"; then
echo "❌ ${service} restart failed"
exit 1
fi
echo "✅ ${service} restart succeeded"
done
}
backup_config
rolling_restart
echo "✅ Maintenance completed"Conclusion
The combination of a federation layer for query aggregation and remote storage for long‑term retention yields an enterprise‑grade Prometheus HA solution. It provides layered monitoring, data durability, automatic failover, and operational simplicity. In production the setup has run stably for over 18 months, handling more than 100 k time series with 99.9 % availability.
Raymond Ops
Linux ops automation, cloud-native, Kubernetes, SRE, DevOps, Python, Golang and related tech discussions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
