How to Build a Global View for Multiple Prometheus Instances – Community and Alibaba Cloud Solutions
This article explains why a global view is needed when Prometheus metrics are scattered across many instances, compares community approaches such as Federation, Thanos, and Remote Write, and details Alibaba Cloud's Global Aggregation Instance and Remote Write solutions with configuration examples and a real‑world case study.
Introduction
Prometheus is the de‑facto standard for cloud‑native monitoring, but large enterprises often run dozens or hundreds of independent Prometheus instances. When metrics are isolated, operators lose the ability to view a unified dashboard, perform cross‑instance queries, or compute aggregates such as sum or rate across all data sources.
Challenges of Multiple Instances
Grafana data‑source explosion: One data source per Prometheus instance leads to hundreds of panels and duplicated PromQL queries.
Cross‑instance calculations: Aggregating the same metric from different instances is impossible without a central store; copying all data to a single instance defeats isolation and incurs high storage cost.
Community Solutions
1. Prometheus Federation
Each edge Prometheus exposes a /federate endpoint. A global Prometheus scrapes these endpoints and stores the selected series. Example prometheus.yml for the global node:
scrape_configs:
- job_name: 'federate'
scrape_interval: 10s
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{job="Prometheus"}'
- '{job="node"}'
static_configs:
- targets:
- 'prometheus-follower-1:9090'
- 'prometheus-follower-2:9090'Federation provides query‑time aggregation without extra storage, but the global node becomes a single point of failure, has limited historic retention, and can become a performance bottleneck.
2. Thanos
Thanos adds sidecar, query, store‑gateway, compact, ruler and receiver components to Prometheus. The sidecar exposes a Store API; the Thanos Query component merges results from all stores and presents a Prometheus‑compatible API. Core components:
Sidecar: Connects a Prometheus instance to Thanos and optionally uploads data to object storage.
Query: Implements the Prometheus HTTP API and aggregates data from multiple stores.
Store Gateway: Serves data from object storage to Query.
Compact: Down‑samples and compresses historic data.
Ruler: Evaluates alerts and writes derived metrics.
Receiver: Accepts remote‑write streams.
Thanos provides long‑term durability and down‑sampling but introduces additional components and still requires an external object store.
3. Prometheus Remote Write
Each Prometheus instance streams samples to a central endpoint (another Prometheus or third‑party storage) via the Remote Write API. Example snippet:
remote_write:
- url: "http://central-instance:9090/api/v1/write"This approach mirrors Federation’s goal but uses a push model, reducing query load on edge instances.
Alibaba Cloud Solutions
1. Global Aggregation Instance
Alibaba Cloud offers a “Prometheus Global Aggregation Instance” that performs query‑time metric aggregation across managed Prometheus instances. Each instance keeps its own TSDB; the Global View queries all instances in parallel, merges results, and returns a unified response without copying data.
Zero extra storage – data stays in the original instances.
Isolation – failure of one instance does not affect others.
Supports cross‑region and cross‑account aggregation.
2. Remote Write on Alibaba Cloud
The cloud service also supports Remote Write. Users configure a Remote Write URL, grant a RAM user the AliyunARMSFullAccess policy, and optionally add label‑filter expressions (e.g., __name__=rpc.*). The central Prometheus stores only the filtered series, enabling a single‑source query.
remote_write:
- url: "http://central-instance:9090/api/v1/write"Network mode can be public or VPC; region selection determines the endpoint domain.
Comparison
Prometheus Federation: Query‑time aggregation, no extra storage, multi‑instance query, but single‑point bottleneck and limited historic retention.
Thanos: Store‑API aggregation, requires object storage, provides single‑instance query via Thanos Query, adds operational complexity.
Alibaba Cloud Global Aggregation: Query‑time aggregation, no extra storage, multi‑instance query, fully managed.
Alibaba Cloud Remote Write: Write‑time aggregation, extra storage for aggregated data, single‑instance query, lower latency.
Case Study: Global Monitoring for a Multi‑Region Platform
A customer with hundreds of Kubernetes clusters across continents needed a single Grafana dashboard. Using the Global Aggregation Instance required adding each cluster’s Prometheus to the aggregation group and selecting the Global View as the Grafana data source. However, cross‑region latency caused query timeouts.
Switching to Remote Write solved the latency issue: each Prometheus streamed filtered metrics to a central instance in Hangzhou. The central instance stored only the needed series, allowing Grafana to query a single source with sub‑second response times, even for global queries.
Conclusion
Choosing a global view for Prometheus depends on storage constraints, query latency, and operational complexity:
Federation & Alibaba Cloud Global Aggregation: No extra storage, but higher query latency and potential bottlenecks.
Thanos: Adds durability and down‑sampling at the cost of extra components and storage.
Remote Write (open‑source or Alibaba Cloud): Consumes additional storage but provides fast, reliable single‑instance queries, making it suitable for large‑scale, latency‑sensitive environments.
Alibaba Cloud Native
We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
