Cloud Native 34 min read

How to Scale Prometheus to Thousands of Nodes with Thanos: A Deep Dive

This article examines the storage, query performance, high‑availability, and high‑cardinality challenges of running Prometheus on a thousand‑node Kubernetes cluster and presents a complete, step‑by‑step Thanos‑based architecture, capacity‑planning models, configuration examples, and operational best practices for reliable horizontal scaling.

MaGe Linux Operations
MaGe Linux Operations
MaGe Linux Operations
How to Scale Prometheus to Thousands of Nodes with Thanos: A Deep Dive

Background and Challenges

In 2026, large‑scale Kubernetes clusters of 3,000‑5,000 nodes are common, generating tens of thousands of Pods and massive metric volumes. Prometheus, originally a single‑node time‑series database, hits three critical bottlenecks at this scale:

Storage capacity : With 5,000 metrics collected every 15 seconds, a single Prometheus instance writes about 5000 * 86400/15 * 200 = 5.76 GB per day. Real‑world workloads easily exceed 100 GB per day, exhausting local disks within weeks.

Query performance : As data grows, PromQL functions such as rate() and sum() must scan many blocks. When a node stores >500 GB, 90‑percentile query latency degrades from milliseconds to several seconds.

High‑availability and data durability : A single Prometheus instance is a point of failure; loss of the host discards all historic data.

High‑cardinality ingestion : Sidecar proxies in service meshes add >10× metrics per Pod, causing OOM crashes.

To address these issues, a distributed monitoring stack based on Thanos is required.

Architecture Evolution: From Single‑Node to Federated Clusters

Three common scaling patterns are compared:

1. Federation

┌─────────────────────────────────────┐
│      Federation Gateway              │
│ /federate?match[]={job="kubernetes"}
└───────┬─────────────┬───────────────┘
        │             │
   ┌────▼─────┐   ┌─────▼───────┐
   │Prometheus-1│   │Prometheus-2 │
   │(Kubernetes)│   │(Nginx/Gateway)│
   └───────┬────┘   └───────┬─────┘
           │               │
   Collect k8s components   Collect nginx metrics

Federation is simple and works for < 500 nodes, but the central Prometheus still bears all query load and cannot perform cross‑instance historical queries.

2. Thanos

┌───────────────────────────────────────────────┐
│                Thanos Query                     │
│            (global query entry, GRPC)            │
└──────┬───────────────┬───────────────┬─────────┘
       │               │              │
 ┌────▼─────┐   ┌─────▼─────┐   ┌────▼─────┐
 │Thanos    │   │Thanos     │   │Thanos    │
 │Sidecar‑1 │   │Sidecar‑2 │   │Store     │
 │Prom‑1    │   │Prom‑2    │   │Gateway    │
 └──────────┘   └──────────┘   └───────────┘
                                 │
                                 └─────► Object storage (S3/GCS)

Thanos adds a distributed layer on top of Prometheus, providing global query, long‑term storage, HA, and down‑sampling without modifying Prometheus binaries.

3. Cortex / Mimir

Cortex (now Grafana Mimir) offers multi‑tenant, fully managed storage but requires complex components such as Cassandra or DynamoDB. For thousand‑node clusters, Thanos remains the most cost‑effective choice.

Core Thanos Design Deep Dive

3.1 Data Ingestion Path

Prometheus writes raw samples to local TSM files. The Thanos Sidecar continuously reads the WAL and uploads TSM blocks to object storage. Uploaded blocks first land in a /store/pending directory, are compacted by the Thanos Compactor, and finally moved to /store/compact for the Store Gateway to read.

3.2 Query Path Parallelisation

A PromQL query (e.g., sum(rate(http_requests_total{job="api"}[5m])) by (service)) is received by Thanos Query, which discovers metadata from all Sidecars and Store Gateways, plans the query, executes it in parallel across all back‑ends, deduplicates results using the replica label, and performs final aggregation.

1. Query Planning
   Parse PromQL, fetch block lists from each Store/Sidecar
2. Parallel Query Execution
   Send data requests to all back‑ends concurrently
3. Deduplication
   Use "replica" label to keep the newest sample
4. Aggregation
   Perform final aggregation (sum, avg, …)

Multiple Thanos Query replicas can be deployed behind a DNS or Service load balancer to achieve linear query‑throughput scaling.

3.3 Storage Format and Down‑sampling

The Compactor merges 2‑hour raw blocks into larger blocks (compaction) and creates down‑sampled versions (5 min and 1 hour) for older data. Down‑sampling reduces query latency dramatically: a 30‑day raw query (~170 k points) drops from ~30 s to ~300 ms when using 1‑hour down‑sampled data.

Raw: retain 0‑30 days, 15 s resolution

5 min down‑sample: retain 30‑90 days

1 h down‑sample: retain >90 days

3.4 Block File Structure

01H2B5GXYZ123D2ZZZZZZZZZZZZZZZZ/
├── chunk        # column‑oriented data points
│   └── 000001
├── index        # label index
├── meta.json    # ULID, time range, down‑sample level
└── tombstones  # deleted series

The meta.json file records ULID, time range, source instance ID, and down‑sampling status, which is essential for troubleshooting missing data.

Storage Layer Horizontal Scaling: Object Store Selection and Configuration

Thanos supports S3‑compatible storage, GCS, Azure Blob, and Alibaba OSS. For a thousand‑node deployment, S3‑compatible storage (e.g., MinIO or AWS S3) offers the best balance of performance and multi‑cloud flexibility. In China, OSS is a pragmatic alternative.

4.2 MinIO Cluster Deployment

# MinIO distributed cluster (8 nodes, 4 disks each)
# Erasure coding: data=6, parity=2 (8+6+2 configuration)
# Approx. 3× usable capacity per node

# minio-start.sh
#!/bin/bash
export MINIO_ROOT_USER=prometheus
export MINIO_ROOT_PASSWORD=ThanosSecure2026!
export MINIO_REGION=us-east-1

/opt/minio/bin/minio server \
  http://minio{1...8}.cluster.local:9000/data{1...4} \
  --console-address ":9001" \
  --certs-dir /opt/minio/certs \
  --address ":9000"

Eight‑node MinIO with 4 TB per node can sustain ~5 GB/s aggregate write throughput, comfortably handling 100‑200 GB daily ingest peaks.

4.3 Thanos Object Store Config

# thanos-storage.yaml
type: S3
config:
  bucket: thanos-data
  endpoint: minio.cluster.local:9000
  region: us-east-1
  access_key: thanos_uploader
  secret_key: Than0sSecret2026!
  s3_force_path_style: true
  http_config:
    idle_conn_timeout: 90s
    response_header_timeout: 120s
  part_size: 5242880   # 5 MB multipart upload
  signature_version2: false

Key tuning parameters: http_config.idle_conn_timeout: default 90 s; increase to 120 s for long‑haul transfers. part_size: MinIO works best with 5 MB parts. response_header_timeout: raise under high load to avoid premature timeouts.

Query Layer Sharding: Receiver and Store Gateway

5.1 Introducing Thanos Receiver

Standard Sidecar upload introduces a 2‑hour visibility gap because data resides in the Prometheus memory buffer before being written to a TSM block. Thanos Receiver accepts remote_write streams, writes them to a local TSDB, and makes the data instantly queryable via Store API.

# Data flow A (Sidecar):
Prometheus → Sidecar → Object Store → Store Gateway

# Data flow B (Receiver):
Prometheus → Remote Write → Thanos Receiver → Query (real‑time)
               ↓
            Object Store (asynchronous)

5.2 Receiver Cluster Deployment

# thanos-receiver.yml
type: receive
http:
  listen: 10902
grpc:
  listen: 10901
remote-write:
  timeout: 30s
  queue_config:
    capacity: 100000
    max_shards: 50
    min_shards: 10
    max_samples_per_send: 10000
replication:
  factor: 2
labels:
  - name: cluster
    value: prod-cluster

Deploy three Receiver instances with replication.factor=2 to provide double‑copy HA. Assuming 1 M series per second, each sample ~200 B, the cluster can sustain ~13 MB/s ingest; three nodes give ~50 MB/s headroom.

5.3 Store Gateway Cache Optimisation

Store Gateway caches metadata (≈4 GB) and data blocks (16‑64 GB) to mitigate object‑store latency.

# Store Gateway launch flags
thanos store \
  --data-dir=/var/lib/thanos/store \
  --grpc-address=0.0.0.0:10901 \
  --http-address=0.0.0.0:10902 \
  --objstore.config-file=s3.yaml \
  --index-cache.json-file=/var/lib/thanos/index-cache.json \
  --store.grpc.series.max-lookback=0 \
  --store.lazy-streaming=true \
  --experimental.enable-streamed-snapshot-chunking

Enable --store.lazy-streaming=true to read blocks only when needed, and the experimental snapshot chunking to reduce memory peaks.

High‑Availability and Failure Recovery

6.1 Prometheus HA Deployment

Run two Prometheus replicas per target, both feeding the same object store. Thanos Query deduplicates using the replica label:

thanos query \
  --query.replica-labels=prometheus \
  --query.replica-labels=prometheus_replica

Combine with a PodDisruptionBudget and preStop hook to ensure graceful shutdown.

6.2 Store Gateway HA Design

Deploy at least two Store Gateways with PodAntiAffinity so they run on different nodes. If all Store Gateways fail, Query can still fetch recent data via Sidecar, but historical queries become unavailable.

6.3 Compactor Failure and Data Repair

Only one Compactor runs at a time (distributed lock). If it crashes, some blocks may remain in pending. Repair steps:

# Inspect inconsistent blocks
thanos tools bucket inspect \
  --objstore.config-file=s3.yaml \
  --min-time=2026-01-01T00:00:00Z \
  --max-time=2026-03-30T00:00:00Z

# Verify block integrity
thanos tools bucket verify \
  --objstore.config-file=s3.yaml \
  --id=01H2B5GXYZ123D2ZZZZZZZZZZZZZZZZ

# Manually trigger compaction if needed
thanos compact \
  --objstore.config-file=s3.yaml \
  --data-dir=/var/lib/thanos/compact \
  --consistency-delay=5m \
  --acceptMalformedIndex

Run compaction hourly; if it exceeds 45 minutes, consider sharding the bucket by prefix.

Capacity Planning and Resource Allocation

7.1 Resource Evaluation Model

Assuming 1,000 nodes, 10 Pods per node, 80 % sidecar coverage, the total series count approaches 5 million. Daily raw data volume is roughly 1.7 TB (30 days ≈ 2.5 TB after overhead). With three‑replica storage, physical storage requirement reaches ~12 TB.

7.2 Kubernetes Resource Quotas

apiVersion: apps/v1
kind: Deployment
metadata:
  name: thanos-query
  namespace: monitoring
spec:
  replicas: 2
  template:
    spec:
      containers:
      - name: thanos
        image: quay.io/thanos/thanos:v0.36.0
        args:
        - query
        - --query.replica-labels=prometheus
        - --query.replica-labels=prometheus_replica
        - --query.timeout=2m
        - --query.max-concurrent=20
        - --http-grace-period=5m
        resources:
          requests:
            cpu: "8"
            memory: "16Gi"
          limits:
            cpu: "16"
            memory: "32Gi"
        livenessProbe:
          httpGet:
            path: /-/healthy
            port: 10902
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /-/ready
            port: 10902
          initialDelaySeconds: 10
          periodSeconds: 5

Key tuning flags: --query.max-concurrent=20 (increase to 50 under heavy load, at the cost of memory). --http-grace-period=5m to allow in‑flight queries to finish during termination.

Best‑Practice Checklist

Architecture Design

Partition data by cluster, namespace, and job before introducing Thanos.

Separate write and read paths with distinct Services.

Enable Remote Write rate‑limiting (capacity, max_shards) to apply back‑pressure.

Operational Practices

Configure cross‑region replication for object‑store buckets.

Deploy a minimal Prometheus instance to monitor Thanos components (GRPC latency, bucket operation duration, queue depth).

Adopt a three‑tier retention policy: raw (0‑15 days), 5 min down‑sample (15‑90 days), 1 h down‑sample (>90 days).

Performance Optimisation

Use Recording Rules in Thanos Ruler to pre‑compute heavy aggregations.

Limit external label cardinality (≤5 labels, each ≤100 unique values).

MonitoringobservabilityKubernetesPrometheusScalingThanos
MaGe Linux Operations
Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.