Cloud Native 15 min read

Boost Kubernetes Monitoring: Why Switch from Prometheus to Thanos

This article examines the limitations of a traditional Prometheus monitoring stack on Kubernetes, explains how adopting a Thanos‑based architecture improves metric retention and reduces infrastructure costs, and provides a detailed multi‑cluster deployment guide with Terraform, code snippets, and visualizations.

MaGe Linux Operations
MaGe Linux Operations
MaGe Linux Operations
Boost Kubernetes Monitoring: Why Switch from Prometheus to Thanos

Introduction

In this article we explore the limitations of the Prometheus monitoring stack and why moving to a Thanos‑based stack can increase metric retention while lowering overall infrastructure costs.

Kubernetes Prometheus Stack

When deploying Kubernetes for customers, a monitoring stack is typically installed on each cluster. The stack usually consists of:

Prometheus – collects metrics

Alertmanager – sends alerts based on metric queries

Grafana – visualizes dashboards

However, this architecture has scalability issues as the number of clusters grows, and storing metric data on local disks becomes expensive, especially when long‑term retention or high‑availability replication is required.

Solution

Several approaches can mitigate these problems:

Expose Prometheus endpoints externally and add them as multiple Grafana data sources (requires TLS or basic auth).

Prometheus federation – useful when scraping a limited set of metrics.

Remote write – not covered in depth here, but useful for pushing metrics to remote storage.

Thanos, It’s Here

Thanos is an open‑source, highly available Prometheus system with long‑term storage capabilities. Its main features include unlimited storage via object stores (e.g., S3, MinIO) and a set of components that communicate over gRPC:

Thanos Sidecar – runs alongside Prometheus, uploads metrics to object storage every two hours, and makes the data queryable.

Thanos Store – acts as a gateway that reads from object storage and serves data to queries.

Thanos Compactor – a singleton that down‑samples and compresses stored metrics to save space.

Thanos Query – the central query component exposing a PromQL‑compatible endpoint and routing queries to stores.

Thanos Query Frontend – splits large queries into smaller ones and caches results.

Thanos architecture diagram
Thanos architecture diagram

Multi‑Cluster Architecture

We deploy two clusters on AWS using tEKS:

Observer cluster – the central cluster that queries other clusters.

Observee cluster – a minimal Kubernetes cluster with Prometheus/Thanos sidecar that is queried by the observer.

The infrastructure is defined with Terraform modules from the terraform‑kubernetes‑addons repository, enabling DRY configurations across accounts, regions, and clusters.

.├── env_tags.yaml
├── eu-west-1
│   └── clusters
│       └── observer
│           ├── eks
│           │   ├── kubeconfig
│           │   └── terragrunt.hcl
│           ├── eks-addons
│           │   └── terragrunt.hcl
│           └── vpc
│               └── terragrunt.hcl
│   └── eu-west-3
│       └── clusters
│           └── observee
│               ├── cluster_values.yaml
│               ├── eks
│               │   ├── kubeconfig
│               │   └── terragrunt.hcl
│               ├── eks-addons
│               │   └── terragrunt.hcl
│               └── vpc
│                   └── terragrunt.hcl

Deep Dive: Running Components

Listing pods in the monitoring namespace shows all Thanos components (sidecar, query, query‑frontend, store‑gateway, compactor) as well as Prometheus, Grafana, and Alertmanager. Ingress resources expose Grafana and Thanos sidecar endpoints via TLS.

kubectl -n monitoring get pods
NAME ...
... (list omitted for brevity)</n
kubectl -n monitoring get ingress
NAME ...
... (list omitted for brevity)

Port‑forwarding the Thanos TLS querier allows direct access to the aggregated metrics store.

kubectl -n monitoring port-forward thanos-tls-querier-observee-query-687dd88ff5-nzpdh 10902

Grafana Visualization

The default Kubernetes dashboard in Grafana works across multiple clusters once the Thanos query endpoint is added as a data source.

Grafana multi‑cluster dashboard
Grafana multi‑cluster dashboard

Conclusion

Thanos is a complex system with many moving parts, but the provided tEKS repository abstracts most of the complexity (especially mTLS) and offers extensive customization. Future work will add support for other cloud providers. For questions, reach out via the GitHub repositories or [email protected].

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

KubernetesPrometheusTerraformThanos
MaGe Linux Operations
Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.