Operations 17 min read

Build a Production‑Ready Prometheus HA Architecture with Federation and Remote Storage

Learn how to design and implement a robust, production‑grade Prometheus high‑availability solution using a federated global cluster, multiple business‑level instances, remote storage with Thanos or VictoriaMetrics, Docker‑Compose deployment, health‑check scripts, performance metrics, alerting rules, and best‑practice operational guidelines.

Raymond Ops
Raymond Ops
Raymond Ops
Build a Production‑Ready Prometheus HA Architecture with Federation and Remote Storage

Introduction

When a monitoring system crashes, the article explains why Prometheus single‑node deployments are a single point of failure and why you need a HA solution.

HA Architecture Design

Problems of a single Prometheus instance

# Traditional single‑node problems
problems:
- 单点故障: 服务器宕机 = 监控全瞎
- 存储限制: 本地磁盘空间有限
- 查询性能: 大量历史数据查询缓慢
- 扩展困难: 无法水平扩容

Core HA principles

Data not lost + Service uninterrupted + High‑performance queries

Application layer HA : multiple Prometheus instances behind a load balancer

Data layer HA : remote storage with data replication

Query layer HA : federation cluster + query sharding

Federation Cluster in Practice

Architecture diagram

┌─────────────────┐
               │   Global Prometheus │
               │   (Federation)      │
               └─────────┬───────┘
                         │
          ┌─────────────────┼─────────────────┐
          │                 │                 │
   ┌──────▼───────┐ ┌──────▼───────┐ ┌─────▼───────┐
   │ Prometheus‑1 │ │ Prometheus‑2 │ │ Prometheus‑N │
   │ (Business A)│ │ (Business B) │ │ (Infra)     │
   └─────────────┘ └─────────────┘ └───────────────┘

Global Prometheus configuration (prometheus‑global.yml)

global:
  scrape_interval: 30s
  evaluation_interval: 30s
  external_labels:
    cluster: 'global'
    replica: '1'

rule_files:
  - "global_rules.yml"

scrape_configs:
  # Federation scrape
  - job_name: 'federate-business-a'
    scrape_interval: 15s
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        - '{job=~"business-a-.*"}'
        - 'up{job=~"business-a-.*"}'
        - 'http_requests_total{job=~"business-a-.*"}'
        - 'mysql_up{job=~"business-a-.*"}'
    static_configs:
      - targets: ['prometheus-business-a:9090']

  - job_name: 'federate-business-b'
    scrape_interval: 15s
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        - '{job=~"business-b-.*"}'
        - 'up{job=~"business-b-.*"}'
        - 'redis_up{job=~"business-b-.*"}'
    static_configs:
      - targets: ['prometheus-business-b:9090']

  # Remote write to Thanos
  remote_write:
    - url: "http://thanos-receive:19291/api/v1/receive"
      queue_config:
        max_samples_per_send: 1000
        capacity: 10000
        max_shards: 200

Business Prometheus configuration example

# prometheus-business-a.yml
global:
  scrape_interval: 15s
  external_labels:
    cluster: 'business-a'
    replica: 'a1'

scrape_configs:
  - job_name: 'business-a-web'
    static_configs:
      - targets: ['web1:8080', 'web2:8080']

  - job_name: 'business-a-mysql'
    static_configs:
      - targets: ['mysql-exporter:9104']

storage:
  tsdb:
    retention.time: 7d
    retention.size: 50GB

remote_write:
  - url: "http://thanos-receive:19291/api/v1/receive"

Docker‑Compose deployment file

version: '3.8'
services:
  # Global Prometheus
  prometheus-global:
    image: prom/prometheus:v2.45.0
    container_name: prometheus-global
    ports:
      - "9090:9090"
    volumes:
      - ./config/prometheus-global.yml:/etc/prometheus/prometheus.yml
      - prometheus-global-data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=30d'
      - '--web.enable-lifecycle'
      - '--web.enable-admin-api'

  # Business A Prometheus
  prometheus-business-a:
    image: prom/prometheus:v2.45.0
    container_name: prometheus-business-a
    ports:
      - "9091:9090"
    volumes:
      - ./config/prometheus-business-a.yml:/etc/prometheus/prometheus.yml
      - prometheus-a-data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=7d'

  # Business B Prometheus
  prometheus-business-b:
    image: prom/prometheus:v2.45.0
    container_name: prometheus-business-b
    ports:
      - "9092:9090"
    volumes:
      - ./config/prometheus-business-b.yml:/etc/prometheus/prometheus.yml
      - prometheus-b-data:/prometheus

volumes:
  prometheus-global-data:
  prometheus-a-data:
  prometheus-b-data:

Remote Storage Options and Practical Deployment

Comparison of storage solutions

Thanos – open source, S3‑compatible, high compression; downside: complex configuration, higher ops cost; best for large‑scale, cost‑sensitive scenarios.

VictoriaMetrics – excellent performance, high compression, good compatibility; downside: newer, smaller community; best for high‑performance needs.

Cortex – full feature set, multi‑tenant support; downside: complex architecture, high resource consumption; best for enterprise‑grade multi‑tenant environments.

Recommended combo: VictoriaMetrics (single‑node) for moderate scale plus Thanos for massive scale.

VictoriaMetrics deployment

# docker-compose-victoria.yml
version: '3.8'
services:
  victoria-metrics:
    image: victoriametrics/victoria-metrics:v1.93.4
    container_name: victoria-metrics
    ports:
      - "8428:8428"
    volumes:
      - victoria-data:/victoria-metrics-data
    command:
      - '--storageDataPath=/victoria-metrics-data'
      - '--retentionPeriod=1y'
      - '--memory.allowedPercent=80'
      - '--search.maxQueryDuration=60s'
      - '--search.maxQueryLength=16384'

  vmagent:  # optional sidecar
    image: victoriametrics/vmagent:v1.93.4
    ports:
      - "8429:8429"
    volumes:
      - ./vmagent.yml:/etc/vmagent/vmagent.yml
    command:
      - '-promscrape.config=/etc/vmagent/vmagent.yml'
      - '-remoteWrite.url=http://victoria-metrics:8428/api/v1/write'

volumes:
  victoria-data:

Thanos full deployment

# docker-compose-thanos.yml
version: '3.8'
services:
  # Thanos Sidecar for the global Prometheus
  thanos-sidecar-global:
    image: thanosio/thanos:v0.32.2
    container_name: thanos-sidecar-global
    command:
      - sidecar
      - '--tsdb.path=/prometheus'
      - '--prometheus.url=http://prometheus-global:9090'
      - '--grpc-address=0.0.0.0:10901'
      - '--http-address=0.0.0.0:10902'
      - '--objstore.config-file=/etc/thanos/bucket.yml'
    volumes:
      - prometheus-global-data:/prometheus
      - ./thanos/bucket.yml:/etc/thanos/bucket.yml
    ports:
      - "10901:10901"
      - "10902:10902"

  # Thanos Store Gateway
  thanos-store:
    image: thanosio/thanos:v0.32.2
    container_name: thanos-store
    command:
      - store
      - '--data-dir=/var/thanos/store'
      - '--objstore.config-file=/etc/thanos/bucket.yml'
      - '--grpc-address=0.0.0.0:10901'
      - '--http-address=0.0.0.0:10902'
    volumes:
      - ./thanos/bucket.yml:/etc/thanos/bucket.yml
      - thanos-store-data:/var/thanos/store
    ports:
      - "10901:10901"
      - "10902:10902"

  # Thanos Querier
  thanos-query:
    image: thanosio/thanos:v0.32.2
    container_name: thanos-query
    command:
      - query
      - '--grpc-address=0.0.0.0:10901'
      - '--http-address=0.0.0.0:9090'
      - '--store=thanos-sidecar-global:10901'
      - '--store=thanos-store:10901'
      - '--query.replica-label=replica'
    ports:
      - "9099:9090"

  # Thanos Compactor
  thanos-compact:
    image: thanosio/thanos:v0.32.2
    container_name: thanos-compact
    command:
      - compact
      - '--data-dir=/var/thanos/compact'
      - '--objstore.config-file=/etc/thanos/bucket.yml'
      - '--retention.resolution-raw=7d'
      - '--retention.resolution-5m=30d'
      - '--retention.resolution-1h=1y'
    volumes:
      - ./thanos/bucket.yml:/etc/thanos/bucket.yml
      - thanos-compact-data:/var/thanos/compact

volumes:
  prometheus-global-data:
  thanos-store-data:
  thanos-compact-data:

S3 storage configuration for Thanos

# thanos/bucket.yml
type: S3
config:
  bucket: "prometheus-thanos"
  endpoint: "s3.amazonaws.com"   # or MinIO: "minio:9000"
  region: "us-east-1"
  access_key: "YOUR_ACCESS_KEY"
  secret_key: "YOUR_SECRET_KEY"
  insecure: false
  signature_version2: false
  encrypt_sse: false
  put_user_metadata:
    "X-Amz-Acl": "bucket-owner-full-control"
  http_config:
    idle_conn_timeout: 90s
    response_header_timeout: 2m

HA Validation and Failure Drills

Service health‑check script

#!/bin/bash
# health_check.sh
check_prometheus() {
  local name=$1
  local url=$2
  if curl -s "${url}/api/v1/query?query=up" | grep -q "success"; then
    echo "✅ ${name} is healthy"
    return 0
  else
    echo "❌ ${name} is down"
    return 1
  fi
}

echo "=== Prometheus cluster health check ==="
check_prometheus "Global Prometheus" "http://localhost:9090"
check_prometheus "Business‑A Prometheus" "http://localhost:9091"
check_prometheus "Business‑B Prometheus" "http://localhost:9092"
check_prometheus "Thanos Query" "http://localhost:9099"

echo "=== Storage backend check ==="
if curl -s "http://localhost:8428/metrics" | grep -q "vm_"; then
  echo "✅ VictoriaMetrics is healthy"
else
  echo "❌ VictoriaMetrics is down"
fi

Failure‑switch test

# Simulate Prometheus instance failure
docker stop prometheus-business-a

# Verify federation still works
curl "http://localhost:9090/api/v1/query?query=up{job=~'business-a-.*'}"

# Verify query via Thanos
curl "http://localhost:9099/api/v1/query?query=up{job=~'business-a-.*'}"

# Restore instance
docker start prometheus-business-a

Data consistency verification script

#!/bin/bash
# data_consistency_check.sh
QUERY="up"
PROM_URL="http://localhost:9090"
THANOS_URL="http://localhost:9099"

echo "Checking data consistency..."
PROM_RESULT=$(curl -s "${PROM_URL}/api/v1/query?query=${QUERY}" | jq '.data.result | length')
THANOS_RESULT=$(curl -s "${THANOS_URL}/api/v1/query?query=${QUERY}" | jq '.data.result | length')

echo "Prometheus result count: ${PROM_RESULT}"
echo "Thanos result count: ${THANOS_RESULT}"

if [ "${PROM_RESULT}" -eq "${THANOS_RESULT}" ]; then
  echo "✅ Data consistency check passed"
else
  echo "⚠️ Data inconsistency detected, investigate configuration"
fi

Performance Metrics and Alert Rules

Key metrics to monitor Prometheus itself

key_metrics:
- prometheus_tsdb_head_samples_appended_total   # write rate
- prometheus_tsdb_compactions_total           # compaction ops
- prometheus_rule_evaluation_duration_seconds # rule eval time
- prometheus_config_last_reload_success_timestamp_seconds # reload success
- go_memstats_alloc_bytes                     # memory usage

Alerting rules (prometheus_alerts.yml)

groups:
- name: prometheus-ha
  rules:
  - alert: PrometheusDown
    expr: up{job="prometheus"} == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Prometheus instance {{ $labels.instance }} is down"
      description: "Prometheus instance has been down for more than 1 minute"

  - alert: PrometheusConfigReloadFailed
    expr: prometheus_config_last_reload_successful != 1
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Prometheus configuration reload failed"

  - alert: ThanosQueryDown
    expr: up{job="thanos-query"} == 0
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "Thanos Query service unavailable"

Production Best Practices

Resource planning

production_specs:
  prometheus_global:
    cpu: "2 cores"
    memory: "8GB"
    disk: "200GB SSD"

  prometheus_business:
    cpu: "1 core"
    memory: "4GB"
    disk: "100GB SSD"

  victoria_metrics:
    cpu: "4 cores"
    memory: "16GB"
    disk: "1TB SSD"

  thanos_components:
    cpu: "1 core each"
    memory: "2GB each"
    disk: "50GB each"

Security hardening checklist

✅ Enable HTTPS for transport encryption

✅ Configure access authentication (Basic Auth / OAuth)

✅ Restrict network access with firewall rules

✅ Regularly update component versions

✅ Monitor abnormal access logs

✅ Backup critical configuration files

Operational automation scripts

#!/bin/bash
# prometheus_maintenance.sh

# Backup configuration
backup_config() {
  DATE=$(date +%Y%m%d_%H%M%S)
  tar -czf "/backup/prometheus_config_${DATE}.tar.gz" ./config/
  echo "Configuration backed up to: /backup/prometheus_config_${DATE}.tar.gz"
}

# Rolling restart of services
rolling_restart() {
  services=("prometheus-business-a" "prometheus-business-b" "prometheus-global")
  for service in "${services[@]}"; do
    echo "Restarting ${service}..."
    docker restart "${service}"
    sleep 30   # wait for stability
    if ! docker ps | grep -q "${service}"; then
      echo "❌ ${service} restart failed"
      exit 1
    fi
    echo "✅ ${service} restart succeeded"
  done
}

backup_config
rolling_restart
echo "✅ Maintenance completed"

Conclusion

The combination of a federation layer for query aggregation and remote storage for long‑term retention yields an enterprise‑grade Prometheus HA solution. It provides layered monitoring, data durability, automatic failover, and operational simplicity. In production the setup has run stably for over 18 months, handling more than 100 k time series with 99.9 % availability.

high availabilityDocker ComposeFederationRemote Storage
Raymond Ops
Written by

Raymond Ops

Linux ops automation, cloud-native, Kubernetes, SRE, DevOps, Python, Golang and related tech discussions.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.