Operations 20 min read

Building a Scalable Prometheus Monitoring Stack with Thanos on Kubernetes

This article explains how to design and deploy a robust monitoring solution using Prometheus, Thanos, Pushgateway, and Alertmanager on Kubernetes, covering metric collection, naming conventions, query language, high‑availability strategies, and practical YAML configurations for a production‑grade observability platform.

Open Source Linux
Open Source Linux
Open Source Linux
Building a Scalable Prometheus Monitoring Stack with Thanos on Kubernetes

Monitoring is a fundamental part of infrastructure that ensures service stability in production; it helps discover, locate, and resolve issues, sometimes even enabling self‑healing.

Typical white‑box metrics include request count per unit time, success/failure rate, and average request latency. While white‑box monitoring shows internal state, black‑box probes complement it by checking external availability, such as DNS failures.

After evaluating options, Prometheus was chosen for its flexible PromQL query language, single‑binary deployment, Go‑based integration, built‑in Web UI, and rich ecosystem (Alertmanager, Pushgateway, Exporters).

Prometheus architecture: it discovers targets via service discovery, scrapes metrics from HTTP endpoints, stores them in a local TSDB, and evaluates alerting rules with PromQL, sending alerts to Alertmanager.

Metric names must use ASCII characters, digits, underscores, and colons, following conventions like using base units (seconds), namespace prefixes (e.g.,

process_cpu_seconds_total

), and descriptive suffixes.

Prometheus provides three core metric types:

Counter : monotonically increasing values (e.g., request count).

Gauge : values that can go up or down (e.g., CPU usage).

Histogram and Summary : bucketed observations for latency or size.

Time series consist of (timestamp, value) pairs; vectors can be instant (single point) or range (multiple points). Example PromQL queries demonstrate instant vectors, range vectors, rate, increase, and histogram quantile calculations.

Prometheus stores samples in memory and periodically compresses them to disk; high cardinality labels should be avoided to prevent exponential growth. Adjusting

storage.tsdb.min-block-duration

and scrape intervals can control memory usage.

For high availability, Prometheus supports federation (hierarchical querying) but suffers from single‑point failures and data duplication. Thanos provides a global query view, aggregating data from multiple Prometheus instances via sidecar components.

Thanos architecture: the Querier receives requests, forwards them to Sidecars, aggregates results, and executes PromQL across non‑overlapping data sets.

Prometheus remote read/write enables storing data in external systems like M3DB, InfluxDB, or OpenTSDB for durable, scalable storage.

Service discovery replaces static target lists, allowing dynamic target management via built‑in discovery mechanisms or file‑based discovery.

Pushgateway handles short‑lived batch jobs by accepting metrics pushes, which Prometheus later scrapes; however, it can retain stale metrics and cause duplication if not managed carefully.

Alertmanager, a separate component, receives alerts from Prometheus, deduplicates, groups, silences, and forwards them to notification channels (e.g., WeChat, DingTalk).

Deploying Prometheus on Kubernetes involves a StatefulSet with three containers (Prometheus, Thanos Sidecar, watch sidecar) and appropriate RBAC permissions.

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: prometheus
spec:
  serviceName: "prometheus"
  replicas: 3
  template:
    spec:
      serviceAccountName: prometheus
      containers:
      - name: prometheus
        image: prom/prometheus:v2.11.1
        args:
          - --config.file=/etc/prometheus-shared/prometheus.yml
          - --web.enable-lifecycle
          - --storage.tsdb.path=/data/prometheus
          - --storage.tsdb.retention=2w
      - name: watch
        image: watch
        args: ["-v", "-t", "-p=/etc/prometheus-shared", "curl", "-X", "POST", "http://localhost:9090/-/reload"]
      - name: thanos
        image: improbable/thanos:v0.6.0
        args:
          - PROM_ID=`echo $POD_NAME| rev | cut -d '-' -f1` /bin/thanos sidecar
          - --prometheus.url=http://localhost:9090

RBAC configuration grants Prometheus access to services, pods, nodes, and configmaps.

apiVersion: v1
kind: ServiceAccount
metadata:
  name: prometheus
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: prometheus
rules:
- apiGroups: [""]
  resources: ["services", "pods", "nodes", "endpoints"]
  verbs: ["get", "list", "watch"]
- apiGroups: [""]
  resources: ["configmaps"]
  verbs: ["create", "get", "update", "delete"]

Thanos Querier deployment connects to store gateways via DNS discovery.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: thanos-query
spec:
  replicas: 2
  template:
    spec:
      containers:
      - name: thanos-query
        image: improbable/thanos:v0.6.0
        args:
          - query
          - --store=dnssrv+thanos-store-gateway.default.svc

Pushgateway and Alertmanager are also deployed as Kubernetes Deployments with Services for external access.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: pushgateway
spec:
  replicas: 15
  template:
    spec:
      containers:
      - name: pushgateway
        image: prom/pushgateway:v1.0.0
        ports:
        - containerPort: 9091
apiVersion: apps/v1
kind: Deployment
metadata:
  name: alertmanager
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: alertmanager
        image: prom/alertmanager:latest
        args:
          - --web.route-prefix=/alertmanager
          - --config.file=/etc/alertmanager/config.yml

Ingress resources expose Pushgateway, Prometheus, Thanos Query, Alertmanager, and Grafana via HTTP paths.

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: prometheus-ingress
spec:
  rules:
  - host: $(DOMAIN)
    http:
      paths:
      - backend:
          serviceName: thanos-query
          servicePort: 10901
        path: /
      - backend:
          serviceName: alertmanager
          servicePort: 9093
        path: /alertmanager

After deployment, accessing the Prometheus UI shows healthy monitoring nodes.

monitoringObservabilityKubernetesPrometheusAlertmanagerThanosPushgateway
Open Source Linux
Written by

Open Source Linux

Focused on sharing Linux/Unix content, covering fundamentals, system development, network programming, automation/operations, cloud computing, and related professional knowledge.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.