How to Build a Scalable Prometheus Monitoring Stack on Kubernetes with Thanos
This article explains why monitoring is essential for production stability, introduces Prometheus fundamentals, metric naming conventions, query types, and high‑availability solutions such as Thanos federation, then walks through a complete Kubernetes deployment including StatefulSets, RBAC, Pushgateway, Alertmanager, and Ingress configuration.
Why Monitoring Matters
Monitoring is a core part of infrastructure that enables developers and operators to detect, locate, and resolve service anomalies quickly, improving overall system reliability.
Prometheus Overview
Prometheus was chosen for its flexible PromQL query language, single‑binary deployment, Go‑based integration, built‑in Web UI, and rich ecosystem (Alertmanager, Pushgateway, Exporters).
Metric Naming and Types
Metric names may contain only ASCII letters, digits, underscores, and colons.
Best practice: prefix with the application namespace (e.g., process_cpu_seconds_total, http_request_duration_seconds).
Use units as suffixes (e.g., *_seconds, *_bytes).
Prometheus supports four basic metric types:
Counter : monotonically increasing values (e.g., request count).
Gauge : values that can go up or down (e.g., CPU usage).
Histogram and Summary : bucketed observations for latency or size distributions.
Time‑Series Fundamentals
Each sample is a (timestamp, value) pair. Adding labels creates multi‑dimensional series, turning a single‑dimensional vector into a matrix when queried over time ranges.
PromQL Query Examples
Instant vector query:
http_requests{host="host1",service="web",code="200",env="test"}Range vector query (last 5 minutes):
http_requests{host="host1",service="web",code="200",env="test"}[:5m]Calculate request rate:
rate(http_requests{host="host1",service="web",code="200",env="test"}[:5m])Calculate increase over 5 minutes:
increase(http_requests{host="host1",service="web",code="200",env="test"}[:5m])90th percentile over 10 minutes:
histogram_quantile(0.9, rate(employee_age_bucket_bucket[10m]))High‑Availability with Thanos
Federation alone cannot eliminate single‑point failures. Thanos provides a global query view, sidecar components for each Prometheus instance, and a store gateway that aggregates data across clusters.
Querier receives requests, forwards them to Sidecars, merges results, and executes PromQL.
Ruler runs rules on the global view and sends alerts.
Storage Options
Prometheus stores data locally, which limits elasticity. Remote Read/Write enables writing to external TSDBs such as M3DB, InfluxDB, or OpenTSDB for high‑availability storage.
Data Collection Modes
Prometheus primarily uses pull (scrape) mode. Static configs work for a few targets, but service discovery (Consul, file‑based, etc.) scales to dynamic environments. Pushgateway handles short‑lived batch jobs by allowing them to push metrics before Prometheus scrapes.
Pushgateway drawbacks include stale data retention and lack of health probing for pushed metrics.
Kubernetes Deployment
A full deployment consists of:
Prometheus StatefulSet with three replicas, sidecar container (Thanos), and a watch container that reloads configuration on ConfigMap changes.
RBAC (ServiceAccount, ClusterRole, ClusterRoleBinding) granting access to services, pods, nodes, and configmaps.
Thanos components : Query, Store‑gateway, and Ruler Deployments with appropriate arguments (e.g., --store=dnssrv+thanos-store-gateway.default.svc).
Pushgateway Deployment (15 replicas) exposing port 9091.
Alertmanager Deployment (3 replicas) with clustering configuration.
Services for each component (Prometheus, Thanos Query, Thanos Store‑gateway, Pushgateway, Alertmanager) using LoadBalancer or ClusterIP as needed.
Ingress rules routing / to Thanos Query, /alertmanager to Alertmanager, /rule to Thanos Ruler, and /metrics to Pushgateway.
After applying the manifests, the Prometheus UI shows healthy scrape targets, confirming a successful monitoring stack.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
