Building Highly Available Prometheus Monitoring with Thanos: A Practical Guide
This article explains why native Prometheus HA solutions fall short for large, multi‑region clusters and shows how to use Thanos components—including sidecar, query, store gateway, and compactor—to achieve long‑term storage, unlimited scaling, a global view, and non‑intrusive integration with existing Prometheus deployments.
Background
In the "High‑availability Prometheus: FAQ" article we briefly mentioned HA solutions for Prometheus. After trying federation and Remote Write, we chose Thanos as the monitoring suite to manage a global view of more than 300 clusters across multiple regions.
Official Prometheus HA Options
HA: two identical Prometheus instances behind a load balancer.
HA + remote storage: write to a remote store for persistence.
Federation: shard data by function, store it in a global node.
Using the official multi‑replica + federation still causes problems because Prometheus local storage lacks data synchronization, making consistency difficult.
Replica A may lose data during a crash, causing gaps when the load balancer routes requests to it.
Different start times or clocks produce mismatched timestamps across replicas.
Federation still has single‑point‑of‑failure nodes; edge and global nodes may become bottlenecks.
Sensitive alerts should avoid triggering from the global node due to potential latency.
Most Prometheus clustering solutions ensure data consistency from storage and query perspectives:
Storage side: use an adapter with Remote Write so only one replica pushes data to the TSDB.
Storage side: write to two TSDBs and sync them.
Query side: solutions like Thanos or VictoriaMetrics keep two copies of data but deduplicate and join them at query time. Thanos stores data in object storage via a sidecar, while VictoriaMetrics uses its own server.
Actual Requirements
Long‑term storage of ~1 month of data, adding tens of gigabytes per day, with low maintenance cost, disaster recovery, and migration capability.
Unlimited scaling: >300 clusters, thousands of nodes, and many services, requiring sharding by function or tenant.
Global view: a single Grafana dashboard showing metrics from all regions, clusters, and pods.
Non‑intrusive: no modifications to Prometheus code; the solution must stay compatible with rapid Prometheus releases.
After evaluating open‑source (Cortex, Thanos, VictoriaMetrics, StackDriver) and commercial products, we selected Thanos because it satisfies long‑term storage, unlimited scaling, global view, and non‑intrusiveness.
Thanos Architecture
Thanos default mode is the sidecar approach.
Besides sidecar, Thanos also provides a less common receive mode.
Thanos consists of the following components:
Bucket
Check
Compactor
Query
Rule
Sidecar
Store
receive (optional)
downsample (optional)
All components are built from a single binary; different commands enable different functionality (e.g.,
./thanos query,
./thanos sidecar).
Components and Configuration
Step 1: Prepare Prometheus
Deploy a Prometheus instance (pod or host). Example launch command:
<code>./prometheus \
--config.file=prometheus.yml \
--log.level=info \
--storage.tsdb.path=data/prometheus \
--web.listen-address='0.0.0.0:9090' \
--storage.tsdb.max-block-duration=2h \
--storage.tsdb.min-block-duration=2h \
--storage.tsdb.wal-compression \
--storage.tsdb.retention.time=2h \
--web.enable-lifecycle</code>Key points:
Enable
web.enable-lifecyclefor hot‑reloading.
Retention of 2 h creates blocks that Thanos uploads to object storage.
Prometheus
prometheus.yml(excerpt):
<code>global:
scrape_interval: 60s
evaluation_interval: 60s
external_labels:
region: 'A'
replica: 0
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['0.0.0.0:9090']
- job_name: 'demo-scrape'
metrics_path: '/metrics'
params:
...</code>Declare
external_labelsfor region and replica identification.
Step 2: Deploy Sidecar
The sidecar runs in the same pod as Prometheus and provides two functions:
Exposes Prometheus data via Thanos Store API using Remote Read.
Optionally uploads each TSDB block to an object store.
Sidecar command example:
<code>./thanos sidecar \
--prometheus.url="http://localhost:8090" \
--objstore.config-file=./conf/bos.yaml \
--tsdb.path=/home/work/opdir/monitor/Prometheus/data/Prometheus/</code>Configure object store (e.g., GCS):
<code>type: GCS
config:
bucket: ""
service_account: ""</code>Step 3: Deploy Query
The query component implements the Prometheus HTTP v1 API and aggregates data from multiple Store APIs (sidecars and store‑gateway).
<code>./thanos query \
--http-address="0.0.0.0:8090" \
--store=replica0:10901 \
--store=replica1:10901 \
--store=replica2:10901 \
--store=127.0.0.1:19914</code>Two important UI checkboxes:
deduplication : removes duplicate series across replicas.
partial response : allows returning data from available replicas when some are down.
Step 4: Deploy Store‑Gateway
Store‑gateway reads persisted blocks from object storage and serves them via the Store API for historical queries.
<code>./thanos store \
--data-dir=./thanos-store-gateway/tmp/store \
--objstore.config-file=./thanos-store-gateway/conf/bos.yaml \
--http-address=0.0.0.0:19904 \
--grpc-address=0.0.0.0:19914 \
--index-cache-size=250MB \
--sync-block-duration=5m \
--min-time=-2w \
--max-time=-1h</code>Store‑gateway can be horizontally scaled; each instance may consume significant CPU and memory.
Step 5: Visualize Data
With all components running, add the Thanos query endpoint to Grafana to obtain a unified view of metrics across regions, clusters, and pods.
Receive Mode
Receive mode replaces the sidecar’s reliance on Prometheus for the most recent 2 h of data by using Remote Write directly. It is useful when network policies prevent access to in‑cluster Prometheus or when a clear separation between tenant and control planes is required.
Some Issues
Prometheus Compression
When using sidecar, set
--storage.tsdb.min-block-durationand
--storage.tsdb.max-block-durationto the same value (2 h) to disable Prometheus’s internal compaction, preventing upload failures.
Store‑Gateway Resource Usage
Store‑gateway’s index cache and block sync can consume large memory; parameters like
--index-cache-size,
--sync-block-duration,
--min-time, and
--max-timecan be tuned to control cache size and query windows.
Compactor Component
Compactor merges old blocks and performs down‑sampling. While it reduces query latency for long‑range queries, it does not reduce disk usage; it actually increases it due to higher‑dimensional aggregates.
Query Deduplication
The query component deduplicates series based on the
query.replica-label. When multiple replicas return differing values, Thanos selects the most stable replica using an internal scoring algorithm.
References
https://thanos.io/
https://www.percona.com/blog/2018/09/20/Prometheus-2-times-series-storage-performance-analyses/
https://qianyongchao.blog/2019/01/03/Prometheus-thanos-design-介绍/
https://github.com/thanos-io/thanos/issues/405
https://katacoda.com/bwplotka/courses/thanos
https://medium.com/faun/comparing-thanos-to-victoriametrics-cluster-b193bea1683
https://www.youtube.com/watch?v=qQN0N14HXPM
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.