Operations 10 min read

How to Build a Scalable, Highly‑Available Monitoring Stack with Thanos, Prometheus & Grafana

Learn how to design a resilient, scalable monitoring solution for multi‑cluster Kubernetes environments using Thanos, Prometheus, and Grafana, covering architecture, data ingestion, querying, long‑term storage on S3, cost savings, and practical deployment tips.

MaGe Linux Operations
MaGe Linux Operations
MaGe Linux Operations
How to Build a Scalable, Highly‑Available Monitoring Stack with Thanos, Prometheus & Grafana

For elastic scaling and high‑availability systems, a large amount of metric data must be collected and stored; this article explains how to build a monitoring solution using Thanos, Prometheus, and Grafana.

The previous enterprise‑grade monitoring solution was easy to integrate with Kubernetes but suffered from query limits, high alert‑generation costs, and difficulty handling many clusters and metrics. After evaluating alternatives such as OpenTSDB, single‑node Prometheus, and TimescaleDB, the author chose Thanos because it provides long‑term retention, replication, high availability, and a global view across clusters.

Architecture

Because the clusters have no persistent storage (all services are stateless), the standard Prometheus + Thanos sidecar pattern cannot be used; metric storage must reside outside the clusters. Clusters are isolated, so Thanos components cannot be bound to a specific set of clusters and must be monitored from the outside.

The final architecture is a multi‑data‑center design with the following components per center:

Grafana + Query servers

Storage servers

Three Receive servers (half the number of clusters)

Grafana uses an AWS RDS instance for its database, which is small and does not require MySQL management.

Four Thanos components are deployed:

Receive : handles TSDB ingestion and replicates data to S3.

Query : serves queries against the Receive data.

Store : reads long‑term metrics from S3 that are no longer in Receive.

Compactor : downsamples and compresses TSDB blocks stored in S3.

Data Ingestion

Each cluster runs a dedicated Prometheus pod that scrapes metrics from the control plane, etcd, and all pods with scraping annotations (including kube‑proxy, kubelet, node‑exporter, state‑metrics, metrics‑server, etc.). The pod forwards data to one of the Receive servers using remote‑write configuration.

All data is first sent to a single Receive instance, which then replicates to the others. DNS‑based GSLB balances load across healthy Receive servers. It is crucial that metrics are sent to only one Receive instance; sending the same metric to multiple instances causes replication failures.

Metrics are also uploaded to an S3 bucket for long‑term retention. Receive uploads a TSDB block every two hours (when a block is closed). Local data is retained for 30 days for troubleshooting and faster queries; data older than 30 days remains only in S3 for up to one year.

Data Query

Collected data is stored in the Receive layer and made queryable via the Query component, which is deployed in each data center for high availability.

Each server runs Grafana and Query. If a server fails, the load balancer can route traffic to the remaining instances. Grafana’s data source points to localhost, ensuring it always uses the local Query service.

The Query component knows all servers that store metrics (Receive and Store) and performs replica deduplication using the flag: --query.replica-label=QUERY.REPLICA-LABEL This configuration lets Query identify and discard duplicate metrics, using only a single data point per metric.

Long‑Term Data

Local retention is limited to 30 days; older data resides in S3, reducing storage costs on the Receive side. The Store component keeps a local copy of each TSDB block index from S3, enabling it to download the necessary blocks when querying data older than 30 days.

Monitoring Statistics

Monitored 6 Kubernetes clusters

Collected metrics from 670 services

Monitored 246 servers with Node Exporter

Ingested ~270 k metrics per minute

~7.3 GB ingested per day (~226 GB per month)

Created 40 dedicated dashboards for Kubernetes components

Configured 116 alerts in Grafana

Because most components run locally, monthly costs dropped by 90.61%, from $38,421.25 to $3,608.99 (including AWS service fees).

Summary

Setting up the architecture took about a month, including evaluating alternatives, validating the design, implementing, enabling collection on clusters, and building dashboards.

Within the first week the benefits were clear: monitoring became easier, dashboards could be built and customized quickly, and metric collection was almost plug‑and‑play. Integration of Grafana with LDAP provided fine‑grained team permission control, allowing developers and SREs to access a wealth of dashboards covering namespaces, ingress, and other metrics.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

monitoringcloud-nativeObservabilityThanos
MaGe Linux Operations
Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.