Operations 11 min read

How to Build a High‑Availability Prometheus Setup Using Federation and Multi‑Remote‑Read

This article examines common misuse of Prometheus federation, explains its limitations, and presents a pure‑Prometheus solution using multi_remote_read to achieve high‑availability monitoring, including configuration examples, code analysis, and best‑practice recommendations for proper data aggregation and query merging.

Programmer DD
Programmer DD
Programmer DD
How to Build a High‑Availability Prometheus Setup Using Federation and Multi‑Remote‑Read

Introduction

Many users misuse Prometheus federation, using it to collect data from multiple scrapers without understanding its purpose. This article analyzes federation problems and proposes a solution based entirely on Prometheus multi_remote_read.

Architecture Diagram

Architecture diagram
Architecture diagram

Federation Problems

Federation documentation: https://prometheus.io/docs/prometheus/latest/federation/

Federation Usage Example

Essentially a scrape cascade: a scrapes from b, c, d.

Can use match to select specific metrics.

Official example configuration:

scrape_configs:
  - job_name: 'federate'
    scrape_interval: 15s
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        - '{job="prometheus"}'
        - '{__name__=~"job:.*"}'
    static_configs:
      - targets:
        - 'source-prometheus-1:9090'
        - 'source-prometheus-2:9090'
        - 'source-prometheus-3:9090'

Analysis of the Example

The scrape path is /federate. The handler is registered in web.go:

// web.go's federate Handler
router.Get("/federate", readyf(httputil.CompressionHandler{Handler: http.HandlerFunc(h.federation)}.ServeHTTP))
Federation reads local storage data and processes it.

The core federation function merges series from local storage and encodes them:

func (h *Handler) federation(w http.ResponseWriter, req *http.Request) {
    q, err := h.localStorage.Querier(req.Context(), mint, maxt)
    defer q.Close()
    vec := make(promql.Vector, 0, 8000)
    hints := &storage.SelectHints{Start: mint, End: maxt}
    var sets []storage.SeriesSet
    set := storage.NewMergeSeriesSet(sets, storage.ChainedSeriesMerge)
    for set.Next() {
        s := set.At()
        vec = append(vec, promql.Sample{Metric: s.Labels(), Point: promql.Point{T: t, V: v}})
    }
    // encode and write response ...
}

If no filtering is applied, federation merely aggregates all shards together, which is useless when data volume is large.

Correct Federation Practices

Use match to filter metrics, separating them into two categories:

Data that needs further aggregation – collected via federation.

Data that can stay on the local scraper.

Perform pre‑aggregation and alerting on the federated side to improve query speed.

Default Prometheus Does Not Support Down‑sampling

Increasing scrape_interval in federation can simulate down‑sampling.

True down‑sampling requires aggregation algorithms (e.g., 5‑minute average, max, min) rather than merely reducing scrape frequency.

Unified Query Implementation

What is remote_read ?

Prometheus uses remote_read to read from external storage when its local store lacks high availability.

Configuration documentation: https://prometheus.io/docs/prometheus/latest/configuration/configuration/#remote_read

Supported Read/Write Storages

AWS Timestream

Azure Data Explorer

Cortex

CrateDB

Google BigQuery

Google Cloud Spanner

InfluxDB

IRONdb

M3DB

PostgreSQL/TimescaleDB

QuasarDB

Splunk

Thanos

TiKV

multi_remote_read

Configuring multiple remote_read endpoints enables concurrent reads from several back‑ends and merges the results:

remote_read:
  - url: "http://172.20.70.205:9090/api/v1/read"
    read_recent: true
  - url: "http://172.20.70.215:9090/api/v1/read"
    read_recent: true

The merge allows PromQL queries and alert rules to ignore the physical location of data.

Prometheus Can Remote‑Read Itself

By enabling --enable-feature=remote-write-receiver, a Prometheus instance can act as both writer and reader, eliminating the need for an external storage for remote reads.

High‑Availability Solution

Combine multiple Prometheus scrapers with stateless Prometheus query nodes to achieve HA:

Monitoring data resides on several local Prometheus instances (bare‑metal or Kubernetes StatefulSets).

Query nodes configure multiple /api/v1/read/ endpoints in remote_read.

Handling Duplicate Data

Query merging automatically deduplicates overlapping data, allowing the same job to be scraped by multiple Prometheus instances for redundancy.

Drawbacks

Concurrent queries must wait for the slowest backend, increasing latency.

Uncontrolled heavy queries can overload scrapers.

All queries are sent to every scraper, causing unnecessary load on nodes that do not hold the requested data.

Bloom filters in back‑ends like M3DB can mitigate unnecessary lookups.

Routing Optimization (Optional)

For precise query routing, refer to the open‑source project prome‑route , which uses reverse proxy rules based on feature labels to shard Prometheus data.

Footnotes

[1] m3db resource overhead, aggregation, down‑sampling, query limits: https://zhuanlan.zhihu.com/p/359551116

[2] m3db‑node OOM tracing and memory allocator code: https://zhuanlan.zhihu.com/p/183815841

[3] Federation documentation: https://prometheus.io/docs/prometheus/latest/federation/

[4] Remote read configuration: https://prometheus.io/docs/prometheus/latest/configuration/configuration/#remote_read

[5] InfluxDB Prometheus support: https://docs.influxdata.com/influxdb/v1.8/supported_protocols/prometheus/

[6] M3DB integration: https://m3db.io/docs/integrations/prometheus/

[7] prome‑route project: https://zhuanlan.zhihu.com/p/231914857

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

PrometheusFederationmulti_remote_readremote_read
Programmer DD
Written by

Programmer DD

A tinkering programmer and author of "Spring Cloud Microservices in Action"

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.