Cloud Native 16 min read

Boost Kubernetes Monitoring: Migrate from Prometheus to Thanos for Scalable Low‑Cost Metrics

This article examines the limitations of a standard Prometheus‑based monitoring stack on Kubernetes, explains how adopting Thanos improves metric retention and reduces infrastructure costs, and provides a detailed multi‑cluster deployment guide with Terraform, TLS configuration, and Grafana visualization.

MaGe Linux Operations

Jan 22, 2022

Boost Kubernetes Monitoring: Migrate from Prometheus to Thanos for Scalable Low‑Cost Metrics

Introduction

In this article we discuss the limitations of the Prometheus monitoring stack and why moving to a Thanos‑based stack can improve metric retention while lowering overall infrastructure costs.

https://github.com/particuleio/teks/tree/main/terragrunt/live/thanos

https://github.com/particuleio/terraform-kubernetes-addons/tree/main/modules/aws

Kubernetes Prometheus Stack

When deploying Kubernetes infrastructure for customers, a monitoring stack is installed on each cluster. The typical stack consists of:

Prometheus – collects metrics

Alertmanager – sends alerts based on metric queries

Grafana – visualizes dashboards

Simplified architecture:

Notes

The architecture does not scale well when the number of clusters from which you want to scrape metrics increases.

Multiple Grafana instances

Each cluster having its own Grafana and dashboards makes maintenance cumbersome.

Metric storage is expensive

Prometheus stores metrics on disk, forcing a trade‑off between storage size and retention period. Long‑term storage on cloud block storage can become costly, and replication or sharding can multiply storage requirements.

Solutions

Multiple Grafana data sources

Expose Prometheus endpoints externally and add them as data sources to a single Grafana, securing the endpoints with TLS (and optionally basic auth). This approach cannot compute across different data sources.

Prometheus federation

Federation allows one Prometheus to scrape another, suitable when the total scraped data volume is low. However, if scrape durations exceed scrape intervals, serious performance issues may arise.

Prometheus remote write

Remote write (implemented by the Thanos receiver) is mentioned but the push‑metrics part is omitted from this article.

Thanos, It’s Here

Thanos is an open‑source, highly available Prometheus system with long‑term storage capabilities, part of the CNCF incubator.

A key feature is “unlimited” storage by using object storage (e.g., S3) provided by most cloud providers.

How does it work?

Thanos works alongside Prometheus; upgrading from Prometheus to Thanos is common.

Thanos is composed of several components that communicate via gRPC.

Thanos Sidecar

The sidecar runs with Prometheus, uploading metrics to an object store every two hours, making Prometheus effectively stateless. Prometheus still keeps two hours of data in memory, so HA or sharding should handle potential loss.

Thanos Store

The store acts as a gateway that translates queries to remote object storage and can cache data locally.

Thanos Compactor

The compactor is a singleton that compresses and down‑samples metrics stored in object storage, reducing storage size and cost.

Thanos Query

The query component receives PromQL queries, exposing a Prometheus‑compatible endpoint and dispatching queries to all configured stores (Sidecar, Store, Query, etc.). It also performs deduplication of identical metrics from multiple sources.

Thanos Store

Thanos Sidecar

Thanos Query

Thanos Query Frontend

The frontend splits large queries into smaller ones and caches results.

Multi‑Cluster Architecture

Multiple deployment patterns exist for these components across clusters; the example uses two EKS clusters on AWS: an observer cluster and an observed cluster.

The observer cluster runs the full stack (Prometheus, Grafana, Thanos components) and queries the observed cluster, which runs a minimal Prometheus/Thanos installation.

├── env_tags.yaml
├── eu-west-1
│   ├── clusters
│   │   └── observer
│   │       ├── eks
│   │       │   ├── kubeconfig
│   │       │   └── terragrunt.hcl
│   │       ├── eks-addons
│   │       │   └── terragrunt.hcl
│   │       └── vpc
│   │           └── terragrunt.hcl
│   └── region_values.yaml
└── eu-west-3
    ├── clusters
    │   └── observee
    │       ├── cluster_values.yaml
    │       ├── eks
    │       │   ├── kubeconfig
    │       │   └── terragrunt.hcl
    │       ├── eks-addons
    │       │   └── terragrunt.hcl
    │       └── vpc
    │           └── terragrunt.hcl
    └── region_values.yaml

The observer cluster generates a CA certificate trusted by the sidecars in the observed clusters and TLS certificates for Thanos querier components.

Deployment of Thanos components includes sidecars, store gateways, query frontends, and TLS‑enabled queriers.

thanos-tls-querier = {
  "observee" = {
    enabled = true
    default_global_requests = true
    default_global_limits = false
    stores = ["thanos-sidecar.${local.default_domain_suffix}:443"]
  }
}

thanos-storegateway = {
  "observee" = {
    enabled = true
    default_global_requests = true
    default_global_limits = false
    bucket = "thanos-store-pio-thanos-observee"
    region = "eu-west-3"
  }
}

The observed cluster runs a minimal Prometheus/Thanos stack, with sidecars uploading to the observer’s bucket and TLS authentication.

kube-prometheus-stack = {
  enabled = true
  allowed_cidrs = dependency.vpc.outputs.private_subnets_cidr_blocks
  thanos_sidecar_enabled = true
  thanos_bucket_force_destroy = true
  extra_values = <<-EXTRA_VALUES
    grafana:
      enabled: false
    prometheus:
      thanosIngress:
        enabled: true
        ingressClassName: nginx
        annotations:
          cert-manager.io/cluster-issuer: "letsencrypt"
          nginx.ingress.kubernetes.io/ssl-redirect: "true"
          nginx.ingress.kubernetes.io/backend-protocol: "GRPC"
          nginx.ingress.kubernetes.io/auth-tls-verify-client: "on"
          nginx.ingress.kubernetes.io/auth-tls-secret: "monitoring/thanos-ca"
        hosts:
          - thanos-sidecar.${local.default_domain_suffix}
        paths:
          - /
        tls:
          - secretName: thanos-sidecar.${local.default_domain_suffix}
            hosts:
              - thanos-sidecar.${local.default_domain_suffix}
    prometheusSpec:
      replicas: 1
      retention: 2d
      retentionSize: "6GB"
      ruleSelectorNilUsesHelmValues: false
      serviceMonitorSelectorNilUsesHelmValues: false
      podMonitorSelectorNilUsesHelmValues: false
      storageSpec:
        volumeClaimTemplate:
          spec:
            storageClassName: ebs-sc
            accessModes: ["ReadWriteOnce"]
            resources:
              requests:
                storage: 10Gi
  EXTRA_VALUES
}

Further inspection of pods and ingresses confirms that the TLS querier can query metrics from the observed clusters.

kubectl -n monitoring get pods
NAME ...

Port‑forwarding commands demonstrate access to Thanos query components.

kubectl -n monitoring port-forward thanos-tls-querier-observee-query-687dd88ff5-nzpdh 10902
kubectl -n monitoring port-forward thanos-query-7c74db546c-d7bp8 10902

Grafana visualizes the default Kubernetes dashboards, which are compatible with the multi‑cluster setup.

Conclusion

Thanos is a complex system with many moving parts; this article provides a high‑level overview and a practical deployment example without delving into every custom configuration detail.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Observability Kubernetes prometheus Terraform Thanos

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.