How to Build a Scalable, Highly‑Available Prometheus Monitoring Stack with Thanos
This article explains why standard Prometheus HA solutions fall short for large, multi‑region deployments, and walks through using Thanos—its components, configuration, and best‑practice tips—to achieve long‑term storage, unlimited scaling, a global view, and non‑intrusive monitoring across 300+ clusters.
Background
In the "High‑Availability Prometheus: FAQ" article we briefly mentioned Prometheus HA solutions. After trying federation and Remote Write, we chose Thanos as the monitoring companion, using its global view to manage monitoring data from multiple regions and over 300 clusters. This article introduces Thanos components and our experience.
Prometheus Official HA Options
HA: two identical Prometheus instances behind a load balancer.
HA + Remote storage: multiple Prometheus replicas write to remote storage via Remote Write.
Federation: shards collect different data and a global node stores the unified view.
Using the official multi‑replica + federation still faces problems because Prometheus local storage lacks data synchronization, making consistency difficult. Typical issues include:
Replica A may lose data during a crash, causing gaps when load‑balanced requests hit A.
Different start times or clocks cause mismatched timestamps across replicas.
Federation still has a single‑point Global node; each layer may need double‑replication.
Sensitive alerts should avoid triggering from the Global node due to potential transmission delays.
Current Practices
Most Prometheus clustering solutions ensure data consistency from storage and query perspectives:
Storage side: use Remote Write with an adapter that elects a leader so only one replica pushes data to the TSDB, guaranteeing no data loss and a single shared remote store.
Storage side (alternative): each replica writes to its own TSDB and synchronizes the two stores.
Query side: solutions like Thanos or VictoriaMetrics keep two copies of data but de‑duplicate and join results at query time. Thanos stores data in object storage via Sidecar; VictoriaMetrics uses its own server.
Actual Requirements
Our cluster size keeps growing, bringing more monitoring types and volumes: master/node monitoring, process monitoring, core component performance, pod resources, kube‑stats, K8s events, plugin monitoring, etc. Beyond HA, we need a global view with the following requirements:
Long‑term storage: about one month of data, tens of gigabytes per day, low maintenance cost, disaster recovery, preferably on cloud TSDB or object storage.
Unlimited scaling: 300+ clusters, thousands of nodes, tens of thousands of services. Sharding by function or tenant is required.
Global view: a single Grafana dashboard showing all regions, clusters, and pods.
Non‑intrusive: no modifications to existing Prometheus instances; the solution should be a thin wrapper that follows upstream releases.
After evaluating open‑source options (Cortex, Thanos, VictoriaMetrics, StackDriver) and commercial products, we selected Thanos because it meets long‑term storage, unlimited scaling, global view, and non‑intrusive requirements.
Thanos Architecture
Default mode: Sidecar.
Besides the sidecar mode, Thanos also offers a less common Receive mode.
Thanos consists of the following components (as listed on the official site):
Bucket
Check
Compactor
Query
Rule
Sidecar
Store
receive (optional)
downsample (optional)
All components are built into a single binary; different functionalities are enabled via command‑line flags.
Components and Configuration
The following steps show how to combine Thanos components for a quick HA Prometheus setup (based on the Quick Start, with recommended configurations as of January 2020).
Step 1: Verify Existing Prometheus
Deploy a single‑node Prometheus (pod or host). Example launch command:
<code>./prometheus \
--config.file=prometheus.yml \
--log.level=info \
--storage.tsdb.path=data/prometheus \
--web.listen-address='0.0.0.0:9090' \
--storage.tsdb.max-block-duration=2h \
--storage.tsdb.min-block-duration=2h \
--storage.tsdb.wal-compression \
--storage.tsdb.retention.time=2h \
--web.enable-lifecycle</code>Key points:
Enable
--web.enable-lifecyclefor hot‑reloading.
Set retention to 2 hours; Prometheus will generate a block every 2 hours, which Thanos uploads to object storage.
Prometheus configuration (prometheus.yml) must declare
external_labelsto identify region and replica:
<code>global:
scrape_interval: 60s
evaluation_interval: 60s
external_labels:
region: 'A'
replica: 0
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['0.0.0.0:9090']
- job_name: 'demo-scrape'
metrics_path: '/metrics'
# ... other configs ...
</code>Requirements for Prometheus:
Version 2.2.1 or higher.
Declare
external_labels.
Enable
--web.enable-admin-apiand
--web.enable-lifecycle.
Step 2: Deploy Sidecar
The Sidecar runs in the same pod as Prometheus and provides two functions:
Exposes Prometheus Remote Read as Thanos Store API, allowing Query to fetch data without direct Prometheus API calls.
Optionally uploads each TSDB block (every 2 hours) to object storage, enabling long‑term retention.
Sidecar launch command:
<code>./thanos sidecar \
--prometheus.url="http://localhost:8090" \
--objstore.config-file=./conf/bos.yaml \
--tsdb.path=/home/work/opdir/monitor/Prometheus/data/Prometheus/</code>If using object storage (e.g., GCS, AWS S3), provide the bucket configuration:
<code>type: GCS
config:
bucket: ""
service_account: ""
</code>Deploy a Sidecar for each Prometheus replica (A, B, C).
Step 3: Deploy Query Component
Query provides the Prometheus HTTP v1 API and can query across multiple Store APIs (Sidecars and Store Gateways).
<code>./thanos query \
--http-address="0.0.0.0:8090" \
--store=replica0:10901 \
--store=replica1:10901 \
--store=replica2:10901 \
--store=127.0.0.1:19914
</code>The
--storeflags point to the Sidecar instances (default port 10901) and optionally a Store Gateway (port 19914).
The Query UI looks similar to Prometheus, allowing you to hide the underlying Prometheus instances.
Two important checkboxes:
Deduplication : removes duplicate series from multiple replicas.
Partial response : enables returning data from available replicas when some are down, trading consistency for availability.
Step 4: Deploy Store Gateway
The Store Gateway reads persisted blocks from object storage and serves them via the Store API, enabling queries of historic data beyond the recent 2‑hour window kept locally.
<code>./thanos store \
--data-dir=./thanos-store-gateway/tmp/store \
--objstore.config-file=./thanos-store-gateway/conf/bos.yaml \
--http-address=0.0.0.0:19904 \
--grpc-address=0.0.0.0:19914 \
--index-cache-size=250MB \
--sync-block-duration=5m \
--min-time=-2w \
--max-time=-1h
</code>The Store Gateway can be scaled horizontally; each instance can fetch the same bucket data.
Step 5: Visualize Data with Grafana
With multi‑region, multi‑replica data aggregated, you can create a single Grafana dashboard showing metrics such as ETCD performance per region, node/exporter stats, pod resource usage, and various Kubernetes components.
Receive Mode (Optional)
Receive mode uses Remote Write directly, avoiding the 2‑hour window limitation of the sidecar. It is useful when network policies prevent sidecar‑to‑Prometheus communication or when a fully external data path is required.
Additional Topics
Prometheus Block Compression
When using Sidecar, set
--storage.tsdb.min-block-durationand
--storage.tsdb.max-block-durationto the same value (2 h) to disable Prometheus’s internal compaction, preventing upload failures.
Store‑Gateway Resource Consumption
The Store Gateway can be memory‑intensive due to index caching. Configuration options such as
--index-cache-size,
--sync-block-duration,
--min-time, and
--max-timeallow tuning of cache size and query windows.
Compactor Component
Compactor merges old blocks and performs down‑sampling for long‑range queries. It does not reduce disk usage; instead, it creates higher‑level aggregates.
Query De‑duplication Logic
Query de‑duplicates based on the
query.replica-label. When multiple replicas return differing values, Thanos selects the most stable replica using an internal scoring algorithm.
References
Thanos official site
Percona performance analysis
Design introduction
GitHub issue
Katacoda tutorial
Comparison with VictoriaMetrics
Video overview
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.