Boost Kubernetes Monitoring: Migrate from Prometheus to Thanos for Scalable Low‑Cost Metrics
This article examines the limitations of a standard Prometheus‑based monitoring stack on Kubernetes, explains how adopting Thanos improves metric retention and reduces infrastructure costs, and provides a detailed multi‑cluster deployment guide with Terraform, TLS configuration, and Grafana visualization.
Introduction
In this article we discuss the limitations of the Prometheus monitoring stack and why moving to a Thanos‑based stack can improve metric retention while lowering overall infrastructure costs.
https://github.com/particuleio/teks/tree/main/terragrunt/live/thanos
https://github.com/particuleio/terraform-kubernetes-addons/tree/main/modules/aws
Kubernetes Prometheus Stack
When deploying Kubernetes infrastructure for customers, a monitoring stack is installed on each cluster. The typical stack consists of:
Prometheus – collects metrics
Alertmanager – sends alerts based on metric queries
Grafana – visualizes dashboards
Simplified architecture:
Notes
The architecture does not scale well when the number of clusters from which you want to scrape metrics increases.
Multiple Grafana instances
Each cluster having its own Grafana and dashboards makes maintenance cumbersome.
Metric storage is expensive
Prometheus stores metrics on disk, forcing a trade‑off between storage size and retention period. Long‑term storage on cloud block storage can become costly, and replication or sharding can multiply storage requirements.
Solutions
Multiple Grafana data sources
Expose Prometheus endpoints externally and add them as data sources to a single Grafana, securing the endpoints with TLS (and optionally basic auth). This approach cannot compute across different data sources.
Prometheus federation
Federation allows one Prometheus to scrape another, suitable when the total scraped data volume is low. However, if scrape durations exceed scrape intervals, serious performance issues may arise.
Prometheus remote write
Remote write (implemented by the Thanos receiver) is mentioned but the push‑metrics part is omitted from this article.
Thanos, It’s Here
Thanos is an open‑source, highly available Prometheus system with long‑term storage capabilities, part of the CNCF incubator.
A key feature is “unlimited” storage by using object storage (e.g., S3) provided by most cloud providers.
How does it work?
Thanos works alongside Prometheus; upgrading from Prometheus to Thanos is common.
Thanos is composed of several components that communicate via gRPC.
Thanos Sidecar
The sidecar runs with Prometheus, uploading metrics to an object store every two hours, making Prometheus effectively stateless. Prometheus still keeps two hours of data in memory, so HA or sharding should handle potential loss.
Thanos Store
The store acts as a gateway that translates queries to remote object storage and can cache data locally.
Thanos Compactor
The compactor is a singleton that compresses and down‑samples metrics stored in object storage, reducing storage size and cost.
Thanos Query
The query component receives PromQL queries, exposing a Prometheus‑compatible endpoint and dispatching queries to all configured stores (Sidecar, Store, Query, etc.). It also performs deduplication of identical metrics from multiple sources.
Thanos Store
Thanos Sidecar
Thanos Query
Thanos Query Frontend
The frontend splits large queries into smaller ones and caches results.
Multi‑Cluster Architecture
Multiple deployment patterns exist for these components across clusters; the example uses two EKS clusters on AWS: an observer cluster and an observed cluster.
The observer cluster runs the full stack (Prometheus, Grafana, Thanos components) and queries the observed cluster, which runs a minimal Prometheus/Thanos installation.
├── env_tags.yaml
├── eu-west-1
│ ├── clusters
│ │ └── observer
│ │ ├── eks
│ │ │ ├── kubeconfig
│ │ │ └── terragrunt.hcl
│ │ ├── eks-addons
│ │ │ └── terragrunt.hcl
│ │ └── vpc
│ │ └── terragrunt.hcl
│ └── region_values.yaml
└── eu-west-3
├── clusters
│ └── observee
│ ├── cluster_values.yaml
│ ├── eks
│ │ ├── kubeconfig
│ │ └── terragrunt.hcl
│ ├── eks-addons
│ │ └── terragrunt.hcl
│ └── vpc
│ └── terragrunt.hcl
└── region_values.yamlThe observer cluster generates a CA certificate trusted by the sidecars in the observed clusters and TLS certificates for Thanos querier components.
Deployment of Thanos components includes sidecars, store gateways, query frontends, and TLS‑enabled queriers.
thanos-tls-querier = {
"observee" = {
enabled = true
default_global_requests = true
default_global_limits = false
stores = ["thanos-sidecar.${local.default_domain_suffix}:443"]
}
}
thanos-storegateway = {
"observee" = {
enabled = true
default_global_requests = true
default_global_limits = false
bucket = "thanos-store-pio-thanos-observee"
region = "eu-west-3"
}
}The observed cluster runs a minimal Prometheus/Thanos stack, with sidecars uploading to the observer’s bucket and TLS authentication.
kube-prometheus-stack = {
enabled = true
allowed_cidrs = dependency.vpc.outputs.private_subnets_cidr_blocks
thanos_sidecar_enabled = true
thanos_bucket_force_destroy = true
extra_values = <<-EXTRA_VALUES
grafana:
enabled: false
prometheus:
thanosIngress:
enabled: true
ingressClassName: nginx
annotations:
cert-manager.io/cluster-issuer: "letsencrypt"
nginx.ingress.kubernetes.io/ssl-redirect: "true"
nginx.ingress.kubernetes.io/backend-protocol: "GRPC"
nginx.ingress.kubernetes.io/auth-tls-verify-client: "on"
nginx.ingress.kubernetes.io/auth-tls-secret: "monitoring/thanos-ca"
hosts:
- thanos-sidecar.${local.default_domain_suffix}
paths:
- /
tls:
- secretName: thanos-sidecar.${local.default_domain_suffix}
hosts:
- thanos-sidecar.${local.default_domain_suffix}
prometheusSpec:
replicas: 1
retention: 2d
retentionSize: "6GB"
ruleSelectorNilUsesHelmValues: false
serviceMonitorSelectorNilUsesHelmValues: false
podMonitorSelectorNilUsesHelmValues: false
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: ebs-sc
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 10Gi
EXTRA_VALUES
}Further inspection of pods and ingresses confirms that the TLS querier can query metrics from the observed clusters.
kubectl -n monitoring get pods
NAME ...Port‑forwarding commands demonstrate access to Thanos query components.
kubectl -n monitoring port-forward thanos-tls-querier-observee-query-687dd88ff5-nzpdh 10902
kubectl -n monitoring port-forward thanos-query-7c74db546c-d7bp8 10902Grafana visualizes the default Kubernetes dashboards, which are compatible with the multi‑cluster setup.
Conclusion
Thanos is a complex system with many moving parts; this article provides a high‑level overview and a practical deployment example without delving into every custom configuration detail.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
