Operations 17 min read

Build a Production-Ready Prometheus HA Architecture with Federation & Remote Storage

This guide walks through designing and implementing a robust, enterprise‑grade Prometheus high‑availability solution using federation clusters, remote storage back‑ends, Docker‑Compose deployments, health‑check scripts, and best‑practice recommendations for monitoring, security, and performance.

MaGe Linux Operations
MaGe Linux Operations
MaGe Linux Operations
Build a Production-Ready Prometheus HA Architecture with Federation & Remote Storage

Prometheus High Availability Solution: Federation Cluster and Remote Storage in Practice

Preface: Why 99% of ops have hit pitfalls with Prometheus HA?

When your monitoring system crashes and the boss asks "What happened?" you may wonder: if Prometheus itself is a single point of failure, how can you monitor Prometheus?

This is not a joke; it's a hard‑learned lesson. Today we share a production‑validated Prometheus HA solution covering federation architecture design and remote storage best practices.

TL;DR : This article walks you through building an enterprise‑grade Prometheus HA stack from scratch, including full configuration files and failover drills. Estimated reading time 15 minutes; bookmark for detailed reading.

1. High‑Availability Architecture Design Ideas

1.1 Fatal Flaws of a Single‑Node Prometheus

# Traditional single‑node deployment issues
problems:
- Single point of failure: server down = monitoring blind
- Storage limits: local disk space limited
- Query performance: large historical queries slow
- Scaling difficulty: cannot scale horizontally

1.2 Core Principles of HA Architecture

No data loss + No service interruption + High query performance

Our solution is based on three layers:

Application‑level HA : multiple Prometheus instances + load balancer

Data‑level HA : remote storage + data replication

Query‑level HA : federation cluster + query sharding

2. Federation Cluster Practical Implementation

2.1 Architecture Diagram

┌─────────────────┐
               │ Global Prometheus │
               │   (Federation)    │
               └───────┬───────┘
                       │
        ┌─────────────────┼─────────────────┐
        │                 │                 │
   ┌────▼───────┐   ┌─────▼───────┐   ┌─────▼───────┐
   │Prometheus‑1│   │Prometheus‑2│   │Prometheus‑N│
   │(Biz A)     │   │(Biz B)     │   │(Infra)     │
   └────────────┘   └────────────┘   └─────────────┘

2.2 Federation Configuration Details

Global Prometheus configuration (prometheus‑global.yml)

global:
  scrape_interval: 30s
  evaluation_interval: 30s
  external_labels:
    cluster: 'global'
    replica: '1'

rule_files:
  - "global_rules.yml"

scrape_configs:
  # Federation scrape config
  - job_name: 'federate-business-a'
    scrape_interval: 15s
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        - '{job=~"business-a-.*"}'
        - 'up{job=~"business-a-.*"}'
        - 'http_requests_total{job=~"business-a-.*"}'
        - 'mysql_up{job=~"business-a-.*"}'
    static_configs:
      - targets:
        - 'prometheus-business-a:9090'

  - job_name: 'federate-business-b'
    scrape_interval: 15s
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        - '{job=~"business-b-.*"}'
        - 'up{job=~"business-b-.*"}'
        - 'redis_up{job=~"business-b-.*"}'
    static_configs:
      - targets:
        - 'prometheus-business-b:9090'

  # Remote write configuration
  remote_write:
    - url: "http://thanos-receive:19291/api/v1/receive"
      queue_config:
        max_samples_per_send: 1000
        capacity: 10000
        max_shards: 200

Business Prometheus configuration example

# prometheus-business-a.yml
global:
  scrape_interval: 15s
  external_labels:
    cluster: 'business-a'
    replica: 'a1'

scrape_configs:
  - job_name: 'business-a-web'
    static_configs:
      - targets: ['web1:8080', 'web2:8080']

  - job_name: 'business-a-mysql'
    static_configs:
      - targets: ['mysql-exporter:9104']

storage:
  tsdb:
    retention.time: 7d
    retention.size: 50GB

remote_write:
  - url: "http://thanos-receive:19291/api/v1/receive"

2.3 Docker‑Compose Deployment File

version: '3.8'
services:
  # Global Prometheus
  prometheus-global:
    image: prom/prometheus:v2.45.0
    container_name: prometheus-global
    ports:
      - "9090:9090"
    volumes:
      - ./config/prometheus-global.yml:/etc/prometheus/prometheus.yml
      - prometheus-global-data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=30d'
      - '--web.enable-lifecycle'
      - '--web.enable-admin-api'

  # Business A Prometheus
  prometheus-business-a:
    image: prom/prometheus:v2.45.0
    container_name: prometheus-business-a
    ports:
      - "9091:9090"
    volumes:
      - ./config/prometheus-business-a.yml:/etc/prometheus/prometheus.yml
      - prometheus-a-data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=7d'

  # Business B Prometheus
  prometheus-business-b:
    image: prom/prometheus:v2.45.0
    container_name: prometheus-business-b
    ports:
      - "9092:9090"
    volumes:
      - ./config/prometheus-business-b.yml:/etc/prometheus/prometheus.yml
      - prometheus-b-data:/prometheus

volumes:
  prometheus-global-data:
  prometheus-a-data:
  prometheus-b-data:

3. Remote Storage Solution Selection and Practice

3.1 Storage Solution Comparison

Recommended solution : VictoriaMetrics (single‑node) + Thanos (large‑scale)

3.2 VictoriaMetrics Deployment

# docker-compose-victoria.yml
version: '3.8'
services:
  victoria-metrics:
    image: victoriametrics/victoria-metrics:v1.93.4
    container_name: victoria-metrics
    ports:
      - "8428:8428"
    volumes:
      - victoria-data:/victoria-metrics-data
    command:
      - '--storageDataPath=/victoria-metrics-data'
      - '--retentionPeriod=1y'
      - '--memory.allowedPercent=80'
      - '--search.maxQueryDuration=60s'
      - '--search.maxQueryLength=16384'

  vmagent:
    image: victoriametrics/vmagent:v1.93.4
    ports:
      - "8429:8429"
    volumes:
      - ./vmagent.yml:/etc/vmagent/vmagent.yml
    command:
      - '-promscrape.config=/etc/vmagent/vmagent.yml'
      - '-remoteWrite.url=http://victoria-metrics:8428/api/v1/write'

volumes:
  victoria-data:

3.3 Thanos Full Deployment

# docker-compose-thanos.yml
version: '3.8'
services:
  # Thanos Sidecar
  thanos-sidecar-global:
    image: thanosio/thanos:v0.32.2
    container_name: thanos-sidecar-global
    command:
      - sidecar
      - '--tsdb.path=/prometheus'
      - '--prometheus.url=http://prometheus-global:9090'
      - '--grpc-address=0.0.0.0:10901'
      - '--http-address=0.0.0.0:10902'
      - '--objstore.config-file=/etc/thanos/bucket.yml'
    volumes:
      - prometheus-global-data:/prometheus
      - ./thanos/bucket.yml:/etc/thanos/bucket.yml
    ports:
      - "10901:10901"
      - "10902:10902"

  # Thanos Store Gateway
  thanos-store:
    image: thanosio/thanos:v0.32.2
    container_name: thanos-store
    command:
      - store
      - '--data-dir=/var/thanos/store'
      - '--objstore.config-file=/etc/thanos/bucket.yml'
      - '--grpc-address=0.0.0.0:10901'
      - '--http-address=0.0.0.0:10902'
    volumes:
      - ./thanos/bucket.yml:/etc/thanos/bucket.yml
      - thanos-store-data:/var/thanos/store
    ports:
      - "10901:10901"
      - "10902:10902"

  # Thanos Querier
  thanos-query:
    image: thanosio/thanos:v0.32.2
    container_name: thanos-query
    command:
      - query
      - '--grpc-address=0.0.0.0:10901'
      - '--http-address=0.0.0.0:9090'
      - '--store=thanos-sidecar-global:10901'
      - '--store=thanos-store:10901'
      - '--query.replica-label=replica'
    ports:
      - "9099:9090"

  # Thanos Compactor
  thanos-compact:
    image: thanosio/thanos:v0.32.2
    container_name: thanos-compact
    command:
      - compact
      - '--data-dir=/var/thanos/compact'
      - '--objstore.config-file=/etc/thanos/bucket.yml'
      - '--retention.resolution-raw=7d'
      - '--retention.resolution-5m=30d'
      - '--retention.resolution-1h=1y'
    volumes:
      - ./thanos/bucket.yml:/etc/thanos/bucket.yml
      - thanos-compact-data:/var/thanos/compact

volumes:
  prometheus-global-data:
  thanos-store-data:
  thanos-compact-data:

3.4 S3 Storage Configuration

# thanos/bucket.yml
type: S3
config:
  bucket: "prometheus-thanos"
  endpoint: "s3.amazonaws.com"
  region: "us-east-1"
  access_key: "YOUR_ACCESS_KEY"
  secret_key: "YOUR_SECRET_KEY"
  insecure: false
  signature_version2: false
  encrypt_sse: false
  put_user_metadata:
    "X-Amz-Acl": "bucket-owner-full-control"
  http_config:
    idle_conn_timeout: 90s
    response_header_timeout: 2m

4. HA Verification and Failure Drills

4.1 Service Health‑Check Script

#!/bin/bash
# health_check.sh
check_prometheus() {
  local name=$1
  local url=$2
  if curl -s "${url}/api/v1/query?query=up" | grep -q "success"; then
    echo "✅ ${name} is healthy"
    return 0
  else
    echo "❌ ${name} is down"
    return 1
  fi
}

echo "=== Prometheus cluster health check ==="
check_prometheus "Global Prometheus" "http://localhost:9090"
check_prometheus "Business-A Prometheus" "http://localhost:9091"
check_prometheus "Business-B Prometheus" "http://localhost:9092"
check_prometheus "Thanos Query" "http://localhost:9099"

echo "=== Storage backend check ==="
if curl -s "http://localhost:8428/metrics" | grep -q "vm_"; then
  echo "✅ VictoriaMetrics is healthy"
else
  echo "❌ VictoriaMetrics is down"
fi

4.2 Failure Switch Test

# Simulate Prometheus instance failure
docker stop prometheus-business-a

# Verify federation layer still works
curl "http://localhost:9090/api/v1/query?query=up{job=~'business-a-.*'}"

# Verify query via Thanos
curl "http://localhost:9099/api/v1/query?query=up{job=~'business-a-.*'}"

# Restore instance
docker start prometheus-business-a

4.3 Data Consistency Validation

#!/bin/bash
QUERY="up"
PROM_URL="http://localhost:9090"
THANOS_URL="http://localhost:9099"

echo "Checking data consistency..."
PROM_RESULT=$(curl -s "${PROM_URL}/api/v1/query?query=${QUERY}" | jq '.data.result | length')
THANOS_RESULT=$(curl -s "${THANOS_URL}/api/v1/query?query=${QUERY}" | jq '.data.result | length')

echo "Prometheus result count: ${PROM_RESULT}"
echo "Thanos result count: ${THANOS_RESULT}"

if [ "${PROM_RESULT}" -eq "${THANOS_RESULT}" ]; then
  echo "✅ Data consistency check passed"
else
  echo "⚠️ Data inconsistency detected, check configuration"
fi

5. Performance Optimization and Monitoring

5.1 Key Performance Indicators

# Important Prometheus metrics to monitor
key_metrics:
  - prometheus_tsdb_head_samples_appended_total   # write rate
  - prometheus_tsdb_compactions_total            # compaction ops
  - prometheus_rule_evaluation_duration_seconds # rule eval time
  - prometheus_config_last_reload_success_timestamp_seconds # config reload
  - go_memstats_alloc_bytes                     # memory usage

5.2 Alert Rules Configuration

# prometheus_alerts.yml
groups:
  - name: prometheus-ha
    rules:
      - alert: PrometheusDown
        expr: up{job="prometheus"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Prometheus instance {{ $labels.instance }} is down"
          description: "Prometheus instance has been down for over 1 minute"

      - alert: PrometheusConfigReloadFailed
        expr: prometheus_config_last_reload_successful != 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Prometheus configuration reload failed"

      - alert: ThanosQueryDown
        expr: up{job="thanos-query"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Thanos Query service unavailable"

5.3 Grafana Dashboard (JSON snippet omitted for brevity)

6. Production Best Practices

6.1 Resource Planning Recommendations

# Production resource specs
production_specs:
  prometheus_global:
    cpu: "2 cores"
    memory: "8GB"
    disk: "200GB SSD"

  prometheus_business:
    cpu: "1 core"
    memory: "4GB"
    disk: "100GB SSD"

  victoria_metrics:
    cpu: "4 cores"
    memory: "16GB"
    disk: "1TB SSD"

  thanos_components:
    cpu: "1 core each"
    memory: "2GB each"
    disk: "50GB each"

6.2 Security Hardening Measures

# Security checklist
- ✅ Enable HTTPS for transport encryption
- ✅ Configure access authentication (Basic Auth/OAuth)
- ✅ Restrict network access via firewall rules
- ✅ Regularly update component versions
- ✅ Monitor abnormal access logs
- ✅ Backup critical configuration files

6.3 Operations Automation Scripts

#!/bin/bash
# prometheus_maintenance.sh

# Backup configuration
backup_config() {
  DATE=$(date +%Y%m%d_%H%M%S)
  tar -czf "/backup/prometheus_config_${DATE}.tar.gz" ./config/
  echo "Configuration backed up to: /backup/prometheus_config_${DATE}.tar.gz"
}

# Rolling restart
rolling_restart() {
  services=("prometheus-business-a" "prometheus-business-b" "prometheus-global")
  for service in "${services[@]}"; do
    echo "Restarting ${service}..."
    docker restart "${service}"
    sleep 30
    if ! docker ps | grep -q "${service}"; then
      echo "❌ ${service} restart failed"
      exit 1
    fi
    echo "✅ ${service} restart succeeded"
  done
}

backup_config
rolling_restart
echo "✅ Maintenance completed"

Conclusion

By combining federation clusters with remote storage, we built an enterprise‑grade Prometheus HA architecture. Core takeaways:

Architecture design : layered monitoring, global federation, business isolation

Storage strategy : short‑term local + long‑term remote, balancing performance and cost

HA guarantees : multiple instances, automatic failover

Ops friendliness : self‑monitoring, timely alerts, simplified operations

This solution has run stably in production for over 18 months, handling more than 100k time series with 99.9% availability.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

high availabilityDocker ComposeFederationRemote Storage
MaGe Linux Operations
Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.