Build a Production-Ready Prometheus HA Architecture with Federation & Remote Storage
This guide walks through designing and implementing a robust, enterprise‑grade Prometheus high‑availability solution using federation clusters, remote storage back‑ends, Docker‑Compose deployments, health‑check scripts, and best‑practice recommendations for monitoring, security, and performance.
Prometheus High Availability Solution: Federation Cluster and Remote Storage in Practice
Preface: Why 99% of ops have hit pitfalls with Prometheus HA?
When your monitoring system crashes and the boss asks "What happened?" you may wonder: if Prometheus itself is a single point of failure, how can you monitor Prometheus?
This is not a joke; it's a hard‑learned lesson. Today we share a production‑validated Prometheus HA solution covering federation architecture design and remote storage best practices.
TL;DR : This article walks you through building an enterprise‑grade Prometheus HA stack from scratch, including full configuration files and failover drills. Estimated reading time 15 minutes; bookmark for detailed reading.
1. High‑Availability Architecture Design Ideas
1.1 Fatal Flaws of a Single‑Node Prometheus
# Traditional single‑node deployment issues
problems:
- Single point of failure: server down = monitoring blind
- Storage limits: local disk space limited
- Query performance: large historical queries slow
- Scaling difficulty: cannot scale horizontally1.2 Core Principles of HA Architecture
No data loss + No service interruption + High query performance
Our solution is based on three layers:
Application‑level HA : multiple Prometheus instances + load balancer
Data‑level HA : remote storage + data replication
Query‑level HA : federation cluster + query sharding
2. Federation Cluster Practical Implementation
2.1 Architecture Diagram
┌─────────────────┐
│ Global Prometheus │
│ (Federation) │
└───────┬───────┘
│
┌─────────────────┼─────────────────┐
│ │ │
┌────▼───────┐ ┌─────▼───────┐ ┌─────▼───────┐
│Prometheus‑1│ │Prometheus‑2│ │Prometheus‑N│
│(Biz A) │ │(Biz B) │ │(Infra) │
└────────────┘ └────────────┘ └─────────────┘2.2 Federation Configuration Details
Global Prometheus configuration (prometheus‑global.yml)
global:
scrape_interval: 30s
evaluation_interval: 30s
external_labels:
cluster: 'global'
replica: '1'
rule_files:
- "global_rules.yml"
scrape_configs:
# Federation scrape config
- job_name: 'federate-business-a'
scrape_interval: 15s
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{job=~"business-a-.*"}'
- 'up{job=~"business-a-.*"}'
- 'http_requests_total{job=~"business-a-.*"}'
- 'mysql_up{job=~"business-a-.*"}'
static_configs:
- targets:
- 'prometheus-business-a:9090'
- job_name: 'federate-business-b'
scrape_interval: 15s
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{job=~"business-b-.*"}'
- 'up{job=~"business-b-.*"}'
- 'redis_up{job=~"business-b-.*"}'
static_configs:
- targets:
- 'prometheus-business-b:9090'
# Remote write configuration
remote_write:
- url: "http://thanos-receive:19291/api/v1/receive"
queue_config:
max_samples_per_send: 1000
capacity: 10000
max_shards: 200Business Prometheus configuration example
# prometheus-business-a.yml
global:
scrape_interval: 15s
external_labels:
cluster: 'business-a'
replica: 'a1'
scrape_configs:
- job_name: 'business-a-web'
static_configs:
- targets: ['web1:8080', 'web2:8080']
- job_name: 'business-a-mysql'
static_configs:
- targets: ['mysql-exporter:9104']
storage:
tsdb:
retention.time: 7d
retention.size: 50GB
remote_write:
- url: "http://thanos-receive:19291/api/v1/receive"2.3 Docker‑Compose Deployment File
version: '3.8'
services:
# Global Prometheus
prometheus-global:
image: prom/prometheus:v2.45.0
container_name: prometheus-global
ports:
- "9090:9090"
volumes:
- ./config/prometheus-global.yml:/etc/prometheus/prometheus.yml
- prometheus-global-data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=30d'
- '--web.enable-lifecycle'
- '--web.enable-admin-api'
# Business A Prometheus
prometheus-business-a:
image: prom/prometheus:v2.45.0
container_name: prometheus-business-a
ports:
- "9091:9090"
volumes:
- ./config/prometheus-business-a.yml:/etc/prometheus/prometheus.yml
- prometheus-a-data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=7d'
# Business B Prometheus
prometheus-business-b:
image: prom/prometheus:v2.45.0
container_name: prometheus-business-b
ports:
- "9092:9090"
volumes:
- ./config/prometheus-business-b.yml:/etc/prometheus/prometheus.yml
- prometheus-b-data:/prometheus
volumes:
prometheus-global-data:
prometheus-a-data:
prometheus-b-data:3. Remote Storage Solution Selection and Practice
3.1 Storage Solution Comparison
Recommended solution : VictoriaMetrics (single‑node) + Thanos (large‑scale)
3.2 VictoriaMetrics Deployment
# docker-compose-victoria.yml
version: '3.8'
services:
victoria-metrics:
image: victoriametrics/victoria-metrics:v1.93.4
container_name: victoria-metrics
ports:
- "8428:8428"
volumes:
- victoria-data:/victoria-metrics-data
command:
- '--storageDataPath=/victoria-metrics-data'
- '--retentionPeriod=1y'
- '--memory.allowedPercent=80'
- '--search.maxQueryDuration=60s'
- '--search.maxQueryLength=16384'
vmagent:
image: victoriametrics/vmagent:v1.93.4
ports:
- "8429:8429"
volumes:
- ./vmagent.yml:/etc/vmagent/vmagent.yml
command:
- '-promscrape.config=/etc/vmagent/vmagent.yml'
- '-remoteWrite.url=http://victoria-metrics:8428/api/v1/write'
volumes:
victoria-data:3.3 Thanos Full Deployment
# docker-compose-thanos.yml
version: '3.8'
services:
# Thanos Sidecar
thanos-sidecar-global:
image: thanosio/thanos:v0.32.2
container_name: thanos-sidecar-global
command:
- sidecar
- '--tsdb.path=/prometheus'
- '--prometheus.url=http://prometheus-global:9090'
- '--grpc-address=0.0.0.0:10901'
- '--http-address=0.0.0.0:10902'
- '--objstore.config-file=/etc/thanos/bucket.yml'
volumes:
- prometheus-global-data:/prometheus
- ./thanos/bucket.yml:/etc/thanos/bucket.yml
ports:
- "10901:10901"
- "10902:10902"
# Thanos Store Gateway
thanos-store:
image: thanosio/thanos:v0.32.2
container_name: thanos-store
command:
- store
- '--data-dir=/var/thanos/store'
- '--objstore.config-file=/etc/thanos/bucket.yml'
- '--grpc-address=0.0.0.0:10901'
- '--http-address=0.0.0.0:10902'
volumes:
- ./thanos/bucket.yml:/etc/thanos/bucket.yml
- thanos-store-data:/var/thanos/store
ports:
- "10901:10901"
- "10902:10902"
# Thanos Querier
thanos-query:
image: thanosio/thanos:v0.32.2
container_name: thanos-query
command:
- query
- '--grpc-address=0.0.0.0:10901'
- '--http-address=0.0.0.0:9090'
- '--store=thanos-sidecar-global:10901'
- '--store=thanos-store:10901'
- '--query.replica-label=replica'
ports:
- "9099:9090"
# Thanos Compactor
thanos-compact:
image: thanosio/thanos:v0.32.2
container_name: thanos-compact
command:
- compact
- '--data-dir=/var/thanos/compact'
- '--objstore.config-file=/etc/thanos/bucket.yml'
- '--retention.resolution-raw=7d'
- '--retention.resolution-5m=30d'
- '--retention.resolution-1h=1y'
volumes:
- ./thanos/bucket.yml:/etc/thanos/bucket.yml
- thanos-compact-data:/var/thanos/compact
volumes:
prometheus-global-data:
thanos-store-data:
thanos-compact-data:3.4 S3 Storage Configuration
# thanos/bucket.yml
type: S3
config:
bucket: "prometheus-thanos"
endpoint: "s3.amazonaws.com"
region: "us-east-1"
access_key: "YOUR_ACCESS_KEY"
secret_key: "YOUR_SECRET_KEY"
insecure: false
signature_version2: false
encrypt_sse: false
put_user_metadata:
"X-Amz-Acl": "bucket-owner-full-control"
http_config:
idle_conn_timeout: 90s
response_header_timeout: 2m4. HA Verification and Failure Drills
4.1 Service Health‑Check Script
#!/bin/bash
# health_check.sh
check_prometheus() {
local name=$1
local url=$2
if curl -s "${url}/api/v1/query?query=up" | grep -q "success"; then
echo "✅ ${name} is healthy"
return 0
else
echo "❌ ${name} is down"
return 1
fi
}
echo "=== Prometheus cluster health check ==="
check_prometheus "Global Prometheus" "http://localhost:9090"
check_prometheus "Business-A Prometheus" "http://localhost:9091"
check_prometheus "Business-B Prometheus" "http://localhost:9092"
check_prometheus "Thanos Query" "http://localhost:9099"
echo "=== Storage backend check ==="
if curl -s "http://localhost:8428/metrics" | grep -q "vm_"; then
echo "✅ VictoriaMetrics is healthy"
else
echo "❌ VictoriaMetrics is down"
fi4.2 Failure Switch Test
# Simulate Prometheus instance failure
docker stop prometheus-business-a
# Verify federation layer still works
curl "http://localhost:9090/api/v1/query?query=up{job=~'business-a-.*'}"
# Verify query via Thanos
curl "http://localhost:9099/api/v1/query?query=up{job=~'business-a-.*'}"
# Restore instance
docker start prometheus-business-a4.3 Data Consistency Validation
#!/bin/bash
QUERY="up"
PROM_URL="http://localhost:9090"
THANOS_URL="http://localhost:9099"
echo "Checking data consistency..."
PROM_RESULT=$(curl -s "${PROM_URL}/api/v1/query?query=${QUERY}" | jq '.data.result | length')
THANOS_RESULT=$(curl -s "${THANOS_URL}/api/v1/query?query=${QUERY}" | jq '.data.result | length')
echo "Prometheus result count: ${PROM_RESULT}"
echo "Thanos result count: ${THANOS_RESULT}"
if [ "${PROM_RESULT}" -eq "${THANOS_RESULT}" ]; then
echo "✅ Data consistency check passed"
else
echo "⚠️ Data inconsistency detected, check configuration"
fi5. Performance Optimization and Monitoring
5.1 Key Performance Indicators
# Important Prometheus metrics to monitor
key_metrics:
- prometheus_tsdb_head_samples_appended_total # write rate
- prometheus_tsdb_compactions_total # compaction ops
- prometheus_rule_evaluation_duration_seconds # rule eval time
- prometheus_config_last_reload_success_timestamp_seconds # config reload
- go_memstats_alloc_bytes # memory usage5.2 Alert Rules Configuration
# prometheus_alerts.yml
groups:
- name: prometheus-ha
rules:
- alert: PrometheusDown
expr: up{job="prometheus"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Prometheus instance {{ $labels.instance }} is down"
description: "Prometheus instance has been down for over 1 minute"
- alert: PrometheusConfigReloadFailed
expr: prometheus_config_last_reload_successful != 1
for: 5m
labels:
severity: warning
annotations:
summary: "Prometheus configuration reload failed"
- alert: ThanosQueryDown
expr: up{job="thanos-query"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Thanos Query service unavailable"5.3 Grafana Dashboard (JSON snippet omitted for brevity)
6. Production Best Practices
6.1 Resource Planning Recommendations
# Production resource specs
production_specs:
prometheus_global:
cpu: "2 cores"
memory: "8GB"
disk: "200GB SSD"
prometheus_business:
cpu: "1 core"
memory: "4GB"
disk: "100GB SSD"
victoria_metrics:
cpu: "4 cores"
memory: "16GB"
disk: "1TB SSD"
thanos_components:
cpu: "1 core each"
memory: "2GB each"
disk: "50GB each"6.2 Security Hardening Measures
# Security checklist
- ✅ Enable HTTPS for transport encryption
- ✅ Configure access authentication (Basic Auth/OAuth)
- ✅ Restrict network access via firewall rules
- ✅ Regularly update component versions
- ✅ Monitor abnormal access logs
- ✅ Backup critical configuration files6.3 Operations Automation Scripts
#!/bin/bash
# prometheus_maintenance.sh
# Backup configuration
backup_config() {
DATE=$(date +%Y%m%d_%H%M%S)
tar -czf "/backup/prometheus_config_${DATE}.tar.gz" ./config/
echo "Configuration backed up to: /backup/prometheus_config_${DATE}.tar.gz"
}
# Rolling restart
rolling_restart() {
services=("prometheus-business-a" "prometheus-business-b" "prometheus-global")
for service in "${services[@]}"; do
echo "Restarting ${service}..."
docker restart "${service}"
sleep 30
if ! docker ps | grep -q "${service}"; then
echo "❌ ${service} restart failed"
exit 1
fi
echo "✅ ${service} restart succeeded"
done
}
backup_config
rolling_restart
echo "✅ Maintenance completed"Conclusion
By combining federation clusters with remote storage, we built an enterprise‑grade Prometheus HA architecture. Core takeaways:
Architecture design : layered monitoring, global federation, business isolation
Storage strategy : short‑term local + long‑term remote, balancing performance and cost
HA guarantees : multiple instances, automatic failover
Ops friendliness : self‑monitoring, timely alerts, simplified operations
This solution has run stably in production for over 18 months, handling more than 100k time series with 99.9% availability.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
