Why Master Prometheus + Grafana for Full‑Stack Monitoring on Kubernetes
In today's cloud‑native era, Prometheus and Grafana have become the de‑facto standard for full‑stack monitoring across servers, databases, containers, and applications, offering multi‑dimensional data models, flexible queries, and alerting that are essential skills for developers, SREs, and ops engineers.
Monitoring is now a core competency for developers, SREs, and operations engineers, and the Prometheus + Grafana stack has become the de‑facto standard in cloud‑native environments, used by more than 90% of enterprises and valued highly in the job market.
What is Prometheus?
Prometheus, originally open‑sourced by SoundCloud, is a time‑series database (TSDB) and monitoring/alerting system that graduated as the second CNCF project after Kubernetes. It offers a multi‑dimensional data model, the PromQL query language, efficient time‑series storage, and native alerting.
Core Features
Multi‑dimensional data model and flexible query – metrics are tagged, enabling arbitrary dimensional combinations and queries via PromQL and an HTTP API.
Local storage for server nodes – the built‑in TSDB handles millions of samples per second and can forward data to external TSDBs such as OpenTSDB.
Open metric data standard – metrics are collected via HTTP pull, with optional push support through a gateway.
Static file and service‑discovery – automatic target discovery for Kubernetes, etcd, Consul, and others.
Easy to maintain – a single binary and ready‑to‑run container image.
Scalable sampling and clustering – supports large‑scale cluster monitoring.
Core Components
Prometheus Server
Retrieval – pulls metrics from configured targets.
TSDB – stores the scraped time‑series locally, with optional remote storage.
PromQL engine – evaluates PromQL queries and performs aggregation.
HTTP Server – exposes the API used by Grafana, Alertmanager, etc.
Exporter – metric collectors deployed on monitored hosts that expose a /metrics endpoint.
Official exporters: Node Exporter, Blackbox Exporter, etc.
Third‑party exporters: MySQL Exporter, Redis Exporter, Nginx Exporter, and many more.
Custom exporters: SDK‑based instrumentation embedded in application code.
Pushgateway – receives metrics pushed from short‑lived jobs and makes them available for Prometheus to scrape.
Alertmanager – aggregates alerts from Prometheus, groups, routes, silences, and forwards them to notification channels, addressing alert storms and duplicate alerts.
Rules files
Recording rules – pre‑compute frequently used PromQL expressions and store the results as new series to boost query performance in high‑traffic scenarios.
Alerting rules – define trigger conditions, duration, labels, and annotations; when satisfied, Prometheus sends an alert to Alertmanager.
Grafana – open‑source visualization platform that natively integrates with Prometheus, providing dashboards, panels, and templating to turn time‑series data into clear visual insights.
Time‑Series Data Model
Each sample consists of four parts:
Metric name – e.g. node_cpu_seconds_total Label set – key‑value pairs such as cpu="0",mode="idle" that enable multi‑dimensional queries.
Timestamp – millisecond‑precision capture time.
Value – floating‑point measurement of the metric.
Metric Types
Counter – monotonically increasing, resets on restart; used for total requests, errors, CPU time, traffic. Example: http_requests_total.
Gauge – can increase or decrease, represents current values such as CPU usage, memory, connections. Example: node_memory_MemAvailable_bytes.
Histogram – bucketed counts that allow server‑side quantile calculation (P95/P99); suited for latency, request size distributions. Example: http_request_duration_seconds.
Summary – client‑side quantile calculation, not aggregatable across instances; used for per‑instance latency statistics. Example: rpc_request_duration_seconds.
Jobs and Instances
Instance – a single scrape target exposing a /metrics endpoint, typically identified by IP and port (e.g., 192.168.1.10:9100).
Job – a collection of instances that share the same role, such as all Node Exporter instances across hosts.
Common Ports
9090 – Prometheus
9100 – Node Exporter
9093 – Alertmanager
3000 – Grafana
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Linux Cloud-Native Ops Stack
Focused on practical internet operations, sharing server monitoring, troubleshooting, automated deployment, and cloud-native tech insights. From Linux basics to advanced K8s, from ops tools to architecture optimization, helping engineers avoid pitfalls, grow quickly, and become your tech companion.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
