How to Scale Prometheus to Thousands of Nodes with Thanos: A Deep Dive
This article examines the storage, query performance, high‑availability, and high‑cardinality challenges of running Prometheus on a thousand‑node Kubernetes cluster and presents a complete, step‑by‑step Thanos‑based architecture, capacity‑planning models, configuration examples, and operational best practices for reliable horizontal scaling.
Background and Challenges
In 2026, large‑scale Kubernetes clusters of 3,000‑5,000 nodes are common, generating tens of thousands of Pods and massive metric volumes. Prometheus, originally a single‑node time‑series database, hits three critical bottlenecks at this scale:
Storage capacity : With 5,000 metrics collected every 15 seconds, a single Prometheus instance writes about 5000 * 86400/15 * 200 = 5.76 GB per day. Real‑world workloads easily exceed 100 GB per day, exhausting local disks within weeks.
Query performance : As data grows, PromQL functions such as rate() and sum() must scan many blocks. When a node stores >500 GB, 90‑percentile query latency degrades from milliseconds to several seconds.
High‑availability and data durability : A single Prometheus instance is a point of failure; loss of the host discards all historic data.
High‑cardinality ingestion : Sidecar proxies in service meshes add >10× metrics per Pod, causing OOM crashes.
To address these issues, a distributed monitoring stack based on Thanos is required.
Architecture Evolution: From Single‑Node to Federated Clusters
Three common scaling patterns are compared:
1. Federation
┌─────────────────────────────────────┐
│ Federation Gateway │
│ /federate?match[]={job="kubernetes"}
└───────┬─────────────┬───────────────┘
│ │
┌────▼─────┐ ┌─────▼───────┐
│Prometheus-1│ │Prometheus-2 │
│(Kubernetes)│ │(Nginx/Gateway)│
└───────┬────┘ └───────┬─────┘
│ │
Collect k8s components Collect nginx metricsFederation is simple and works for < 500 nodes, but the central Prometheus still bears all query load and cannot perform cross‑instance historical queries.
2. Thanos
┌───────────────────────────────────────────────┐
│ Thanos Query │
│ (global query entry, GRPC) │
└──────┬───────────────┬───────────────┬─────────┘
│ │ │
┌────▼─────┐ ┌─────▼─────┐ ┌────▼─────┐
│Thanos │ │Thanos │ │Thanos │
│Sidecar‑1 │ │Sidecar‑2 │ │Store │
│Prom‑1 │ │Prom‑2 │ │Gateway │
└──────────┘ └──────────┘ └───────────┘
│
└─────► Object storage (S3/GCS)Thanos adds a distributed layer on top of Prometheus, providing global query, long‑term storage, HA, and down‑sampling without modifying Prometheus binaries.
3. Cortex / Mimir
Cortex (now Grafana Mimir) offers multi‑tenant, fully managed storage but requires complex components such as Cassandra or DynamoDB. For thousand‑node clusters, Thanos remains the most cost‑effective choice.
Core Thanos Design Deep Dive
3.1 Data Ingestion Path
Prometheus writes raw samples to local TSM files. The Thanos Sidecar continuously reads the WAL and uploads TSM blocks to object storage. Uploaded blocks first land in a /store/pending directory, are compacted by the Thanos Compactor, and finally moved to /store/compact for the Store Gateway to read.
3.2 Query Path Parallelisation
A PromQL query (e.g., sum(rate(http_requests_total{job="api"}[5m])) by (service)) is received by Thanos Query, which discovers metadata from all Sidecars and Store Gateways, plans the query, executes it in parallel across all back‑ends, deduplicates results using the replica label, and performs final aggregation.
1. Query Planning
Parse PromQL, fetch block lists from each Store/Sidecar
2. Parallel Query Execution
Send data requests to all back‑ends concurrently
3. Deduplication
Use "replica" label to keep the newest sample
4. Aggregation
Perform final aggregation (sum, avg, …)Multiple Thanos Query replicas can be deployed behind a DNS or Service load balancer to achieve linear query‑throughput scaling.
3.3 Storage Format and Down‑sampling
The Compactor merges 2‑hour raw blocks into larger blocks (compaction) and creates down‑sampled versions (5 min and 1 hour) for older data. Down‑sampling reduces query latency dramatically: a 30‑day raw query (~170 k points) drops from ~30 s to ~300 ms when using 1‑hour down‑sampled data.
Raw: retain 0‑30 days, 15 s resolution
5 min down‑sample: retain 30‑90 days
1 h down‑sample: retain >90 days
3.4 Block File Structure
01H2B5GXYZ123D2ZZZZZZZZZZZZZZZZ/
├── chunk # column‑oriented data points
│ └── 000001
├── index # label index
├── meta.json # ULID, time range, down‑sample level
└── tombstones # deleted seriesThe meta.json file records ULID, time range, source instance ID, and down‑sampling status, which is essential for troubleshooting missing data.
Storage Layer Horizontal Scaling: Object Store Selection and Configuration
Thanos supports S3‑compatible storage, GCS, Azure Blob, and Alibaba OSS. For a thousand‑node deployment, S3‑compatible storage (e.g., MinIO or AWS S3) offers the best balance of performance and multi‑cloud flexibility. In China, OSS is a pragmatic alternative.
4.2 MinIO Cluster Deployment
# MinIO distributed cluster (8 nodes, 4 disks each)
# Erasure coding: data=6, parity=2 (8+6+2 configuration)
# Approx. 3× usable capacity per node
# minio-start.sh
#!/bin/bash
export MINIO_ROOT_USER=prometheus
export MINIO_ROOT_PASSWORD=ThanosSecure2026!
export MINIO_REGION=us-east-1
/opt/minio/bin/minio server \
http://minio{1...8}.cluster.local:9000/data{1...4} \
--console-address ":9001" \
--certs-dir /opt/minio/certs \
--address ":9000"Eight‑node MinIO with 4 TB per node can sustain ~5 GB/s aggregate write throughput, comfortably handling 100‑200 GB daily ingest peaks.
4.3 Thanos Object Store Config
# thanos-storage.yaml
type: S3
config:
bucket: thanos-data
endpoint: minio.cluster.local:9000
region: us-east-1
access_key: thanos_uploader
secret_key: Than0sSecret2026!
s3_force_path_style: true
http_config:
idle_conn_timeout: 90s
response_header_timeout: 120s
part_size: 5242880 # 5 MB multipart upload
signature_version2: falseKey tuning parameters: http_config.idle_conn_timeout: default 90 s; increase to 120 s for long‑haul transfers. part_size: MinIO works best with 5 MB parts. response_header_timeout: raise under high load to avoid premature timeouts.
Query Layer Sharding: Receiver and Store Gateway
5.1 Introducing Thanos Receiver
Standard Sidecar upload introduces a 2‑hour visibility gap because data resides in the Prometheus memory buffer before being written to a TSM block. Thanos Receiver accepts remote_write streams, writes them to a local TSDB, and makes the data instantly queryable via Store API.
# Data flow A (Sidecar):
Prometheus → Sidecar → Object Store → Store Gateway
# Data flow B (Receiver):
Prometheus → Remote Write → Thanos Receiver → Query (real‑time)
↓
Object Store (asynchronous)5.2 Receiver Cluster Deployment
# thanos-receiver.yml
type: receive
http:
listen: 10902
grpc:
listen: 10901
remote-write:
timeout: 30s
queue_config:
capacity: 100000
max_shards: 50
min_shards: 10
max_samples_per_send: 10000
replication:
factor: 2
labels:
- name: cluster
value: prod-clusterDeploy three Receiver instances with replication.factor=2 to provide double‑copy HA. Assuming 1 M series per second, each sample ~200 B, the cluster can sustain ~13 MB/s ingest; three nodes give ~50 MB/s headroom.
5.3 Store Gateway Cache Optimisation
Store Gateway caches metadata (≈4 GB) and data blocks (16‑64 GB) to mitigate object‑store latency.
# Store Gateway launch flags
thanos store \
--data-dir=/var/lib/thanos/store \
--grpc-address=0.0.0.0:10901 \
--http-address=0.0.0.0:10902 \
--objstore.config-file=s3.yaml \
--index-cache.json-file=/var/lib/thanos/index-cache.json \
--store.grpc.series.max-lookback=0 \
--store.lazy-streaming=true \
--experimental.enable-streamed-snapshot-chunkingEnable --store.lazy-streaming=true to read blocks only when needed, and the experimental snapshot chunking to reduce memory peaks.
High‑Availability and Failure Recovery
6.1 Prometheus HA Deployment
Run two Prometheus replicas per target, both feeding the same object store. Thanos Query deduplicates using the replica label:
thanos query \
--query.replica-labels=prometheus \
--query.replica-labels=prometheus_replicaCombine with a PodDisruptionBudget and preStop hook to ensure graceful shutdown.
6.2 Store Gateway HA Design
Deploy at least two Store Gateways with PodAntiAffinity so they run on different nodes. If all Store Gateways fail, Query can still fetch recent data via Sidecar, but historical queries become unavailable.
6.3 Compactor Failure and Data Repair
Only one Compactor runs at a time (distributed lock). If it crashes, some blocks may remain in pending. Repair steps:
# Inspect inconsistent blocks
thanos tools bucket inspect \
--objstore.config-file=s3.yaml \
--min-time=2026-01-01T00:00:00Z \
--max-time=2026-03-30T00:00:00Z
# Verify block integrity
thanos tools bucket verify \
--objstore.config-file=s3.yaml \
--id=01H2B5GXYZ123D2ZZZZZZZZZZZZZZZZ
# Manually trigger compaction if needed
thanos compact \
--objstore.config-file=s3.yaml \
--data-dir=/var/lib/thanos/compact \
--consistency-delay=5m \
--acceptMalformedIndexRun compaction hourly; if it exceeds 45 minutes, consider sharding the bucket by prefix.
Capacity Planning and Resource Allocation
7.1 Resource Evaluation Model
Assuming 1,000 nodes, 10 Pods per node, 80 % sidecar coverage, the total series count approaches 5 million. Daily raw data volume is roughly 1.7 TB (30 days ≈ 2.5 TB after overhead). With three‑replica storage, physical storage requirement reaches ~12 TB.
7.2 Kubernetes Resource Quotas
apiVersion: apps/v1
kind: Deployment
metadata:
name: thanos-query
namespace: monitoring
spec:
replicas: 2
template:
spec:
containers:
- name: thanos
image: quay.io/thanos/thanos:v0.36.0
args:
- query
- --query.replica-labels=prometheus
- --query.replica-labels=prometheus_replica
- --query.timeout=2m
- --query.max-concurrent=20
- --http-grace-period=5m
resources:
requests:
cpu: "8"
memory: "16Gi"
limits:
cpu: "16"
memory: "32Gi"
livenessProbe:
httpGet:
path: /-/healthy
port: 10902
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /-/ready
port: 10902
initialDelaySeconds: 10
periodSeconds: 5Key tuning flags: --query.max-concurrent=20 (increase to 50 under heavy load, at the cost of memory). --http-grace-period=5m to allow in‑flight queries to finish during termination.
Best‑Practice Checklist
Architecture Design
Partition data by cluster, namespace, and job before introducing Thanos.
Separate write and read paths with distinct Services.
Enable Remote Write rate‑limiting (capacity, max_shards) to apply back‑pressure.
Operational Practices
Configure cross‑region replication for object‑store buckets.
Deploy a minimal Prometheus instance to monitor Thanos components (GRPC latency, bucket operation duration, queue depth).
Adopt a three‑tier retention policy: raw (0‑15 days), 5 min down‑sample (15‑90 days), 1 h down‑sample (>90 days).
Performance Optimisation
Use Recording Rules in Thanos Ruler to pre‑compute heavy aggregations.
Limit external label cardinality (≤5 labels, each ≤100 unique values).
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
