Boost Kubernetes Monitoring: Why Switch from Prometheus to Thanos
This article examines the limitations of a traditional Prometheus monitoring stack on Kubernetes, explains how adopting a Thanos‑based architecture improves metric retention and reduces infrastructure costs, and provides a detailed multi‑cluster deployment guide with Terraform, code snippets, and visualizations.
Introduction
In this article we explore the limitations of the Prometheus monitoring stack and why moving to a Thanos‑based stack can increase metric retention while lowering overall infrastructure costs.
Kubernetes Prometheus Stack
When deploying Kubernetes for customers, a monitoring stack is typically installed on each cluster. The stack usually consists of:
Prometheus – collects metrics
Alertmanager – sends alerts based on metric queries
Grafana – visualizes dashboards
However, this architecture has scalability issues as the number of clusters grows, and storing metric data on local disks becomes expensive, especially when long‑term retention or high‑availability replication is required.
Solution
Several approaches can mitigate these problems:
Expose Prometheus endpoints externally and add them as multiple Grafana data sources (requires TLS or basic auth).
Prometheus federation – useful when scraping a limited set of metrics.
Remote write – not covered in depth here, but useful for pushing metrics to remote storage.
Thanos, It’s Here
Thanos is an open‑source, highly available Prometheus system with long‑term storage capabilities. Its main features include unlimited storage via object stores (e.g., S3, MinIO) and a set of components that communicate over gRPC:
Thanos Sidecar – runs alongside Prometheus, uploads metrics to object storage every two hours, and makes the data queryable.
Thanos Store – acts as a gateway that reads from object storage and serves data to queries.
Thanos Compactor – a singleton that down‑samples and compresses stored metrics to save space.
Thanos Query – the central query component exposing a PromQL‑compatible endpoint and routing queries to stores.
Thanos Query Frontend – splits large queries into smaller ones and caches results.
Multi‑Cluster Architecture
We deploy two clusters on AWS using tEKS:
Observer cluster – the central cluster that queries other clusters.
Observee cluster – a minimal Kubernetes cluster with Prometheus/Thanos sidecar that is queried by the observer.
The infrastructure is defined with Terraform modules from the terraform‑kubernetes‑addons repository, enabling DRY configurations across accounts, regions, and clusters.
.├── env_tags.yaml
├── eu-west-1
│ └── clusters
│ └── observer
│ ├── eks
│ │ ├── kubeconfig
│ │ └── terragrunt.hcl
│ ├── eks-addons
│ │ └── terragrunt.hcl
│ └── vpc
│ └── terragrunt.hcl
│ └── eu-west-3
│ └── clusters
│ └── observee
│ ├── cluster_values.yaml
│ ├── eks
│ │ ├── kubeconfig
│ │ └── terragrunt.hcl
│ ├── eks-addons
│ │ └── terragrunt.hcl
│ └── vpc
│ └── terragrunt.hclDeep Dive: Running Components
Listing pods in the monitoring namespace shows all Thanos components (sidecar, query, query‑frontend, store‑gateway, compactor) as well as Prometheus, Grafana, and Alertmanager. Ingress resources expose Grafana and Thanos sidecar endpoints via TLS.
kubectl -n monitoring get pods
NAME ...
... (list omitted for brevity)</n
kubectl -n monitoring get ingress
NAME ...
... (list omitted for brevity)Port‑forwarding the Thanos TLS querier allows direct access to the aggregated metrics store.
kubectl -n monitoring port-forward thanos-tls-querier-observee-query-687dd88ff5-nzpdh 10902Grafana Visualization
The default Kubernetes dashboard in Grafana works across multiple clusters once the Thanos query endpoint is added as a data source.
Conclusion
Thanos is a complex system with many moving parts, but the provided tEKS repository abstracts most of the complexity (especially mTLS) and offers extensive customization. Future work will add support for other cloud providers. For questions, reach out via the GitHub repositories or [email protected].
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
