Operations 20 min read

How to Build a Scalable Prometheus Monitoring Stack on Kubernetes with Thanos

This article explains why monitoring is essential for production stability, introduces Prometheus fundamentals, metric naming conventions, query types, and high‑availability solutions such as Thanos federation, then walks through a complete Kubernetes deployment including StatefulSets, RBAC, Pushgateway, Alertmanager, and Ingress configuration.

dbaplus Community
dbaplus Community
dbaplus Community
How to Build a Scalable Prometheus Monitoring Stack on Kubernetes with Thanos

Why Monitoring Matters

Monitoring is a core part of infrastructure that enables developers and operators to detect, locate, and resolve service anomalies quickly, improving overall system reliability.

Prometheus Overview

Prometheus was chosen for its flexible PromQL query language, single‑binary deployment, Go‑based integration, built‑in Web UI, and rich ecosystem (Alertmanager, Pushgateway, Exporters).

Metric Naming and Types

Metric names may contain only ASCII letters, digits, underscores, and colons.

Best practice: prefix with the application namespace (e.g., process_cpu_seconds_total, http_request_duration_seconds).

Use units as suffixes (e.g., *_seconds, *_bytes).

Prometheus supports four basic metric types:

Counter : monotonically increasing values (e.g., request count).

Gauge : values that can go up or down (e.g., CPU usage).

Histogram and Summary : bucketed observations for latency or size distributions.

Time‑Series Fundamentals

Each sample is a (timestamp, value) pair. Adding labels creates multi‑dimensional series, turning a single‑dimensional vector into a matrix when queried over time ranges.

PromQL Query Examples

Instant vector query:

http_requests{host="host1",service="web",code="200",env="test"}

Range vector query (last 5 minutes):

http_requests{host="host1",service="web",code="200",env="test"}[:5m]

Calculate request rate:

rate(http_requests{host="host1",service="web",code="200",env="test"}[:5m])

Calculate increase over 5 minutes:

increase(http_requests{host="host1",service="web",code="200",env="test"}[:5m])

90th percentile over 10 minutes:

histogram_quantile(0.9, rate(employee_age_bucket_bucket[10m]))

High‑Availability with Thanos

Federation alone cannot eliminate single‑point failures. Thanos provides a global query view, sidecar components for each Prometheus instance, and a store gateway that aggregates data across clusters.

Querier receives requests, forwards them to Sidecars, merges results, and executes PromQL.

Ruler runs rules on the global view and sends alerts.

Storage Options

Prometheus stores data locally, which limits elasticity. Remote Read/Write enables writing to external TSDBs such as M3DB, InfluxDB, or OpenTSDB for high‑availability storage.

Data Collection Modes

Prometheus primarily uses pull (scrape) mode. Static configs work for a few targets, but service discovery (Consul, file‑based, etc.) scales to dynamic environments. Pushgateway handles short‑lived batch jobs by allowing them to push metrics before Prometheus scrapes.

Pushgateway drawbacks include stale data retention and lack of health probing for pushed metrics.

Kubernetes Deployment

A full deployment consists of:

Prometheus StatefulSet with three replicas, sidecar container (Thanos), and a watch container that reloads configuration on ConfigMap changes.

RBAC (ServiceAccount, ClusterRole, ClusterRoleBinding) granting access to services, pods, nodes, and configmaps.

Thanos components : Query, Store‑gateway, and Ruler Deployments with appropriate arguments (e.g., --store=dnssrv+thanos-store-gateway.default.svc).

Pushgateway Deployment (15 replicas) exposing port 9091.

Alertmanager Deployment (3 replicas) with clustering configuration.

Services for each component (Prometheus, Thanos Query, Thanos Store‑gateway, Pushgateway, Alertmanager) using LoadBalancer or ClusterIP as needed.

Ingress rules routing / to Thanos Query, /alertmanager to Alertmanager, /rule to Thanos Ruler, and /metrics to Pushgateway.

After applying the manifests, the Prometheus UI shows healthy scrape targets, confirming a successful monitoring stack.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

KubernetesDevOpsAlertmanagerThanosPushgateway
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.