Mastering Prometheus on Kubernetes: A Step‑by‑Step Guide for Cloud‑Native Monitoring
This article introduces Prometheus fundamentals, its architecture and metric types, then walks through a complete Kubernetes deployment—including namespace, RBAC, ConfigMap, and various exporters—showing how to collect metrics, configure alerts, and visualize data with Grafana, while highlighting limitations and future improvements.
Background
Feature platform uses PyFlink on Kubernetes for ETL, pulling data from HBase, Hive, relational databases into a unified feature store for data scientists, engineers, and ML engineers, addressing scattered storage, feature duplication, complex extraction, and difficult usage.
The project built its own Kubernetes container management platform running Flink, Zeppelin, etc. Monitoring the K8s cluster and alerting on anomalies is essential; Prometheus is the chosen solution.
Prometheus Overview
What is Prometheus?
Prometheus is an open‑source monitoring system originally developed by SoundCloud in Go, now a CNCF graduated project. It offers a multi‑dimensional data model, HTTP pull collection, PromQL query language, single‑node operation, service discovery, and a rich ecosystem.
It combines a monitoring/alerting system with a built‑in time‑series database (TSDB).
Architecture
Key components include Prometheus Server (scrapes targets, stores data, provides PromQL), Pushgateway (short‑lived jobs), Exporters (expose metrics), Alertmanager (deduplicate, group, route alerts), and service‑discovery mechanisms.
Metric Types
Metric format
Each sample consists of a metric name with labels, a timestamp (millisecond precision), and a float64 value.
Sample types
Counter – only increases (e.g., total HTTP requests).
Gauge – can increase or decrease (e.g., current memory usage).
Histogram – buckets for distribution analysis.
Summary – pre‑computed quantiles.
Limitations
Not suitable for logs, tracing, or events; requires complementary tools such as Fluentd and Elasticsearch.
Pull model may need careful network planning at scale.
Local storage is intended for short‑term data (≈1 month); long‑term storage needs remote back‑ends.
Practical Deployment on the Feature Platform
The platform runs Flink, Zeppelin, Elasticsearch, etc., on a Kubernetes cluster (Huawei Cloud). Prometheus monitors node performance (node‑exporter), container performance (cAdvisor), cluster state (kube‑state‑metrics), and Elasticsearch (elastic‑exporter). Alerts are sent to Alertmanager, which forwards them to Feishu via webhook.
Installation Steps
Create a dedicated namespace and RBAC rules for Prometheus.
Store the Prometheus configuration in a ConfigMap (scrape interval, alerting, rule files, target definitions).
Deploy Prometheus Server, Alertmanager, Grafana, kube‑state‑metrics, node‑exporter, and exporters using Deployment or DaemonSet resources.
Expose Prometheus and Grafana via NodePort services for external access.
Define alerting rules for pod failures, node issues, Elasticsearch health, CPU/memory thresholds, etc.
Key YAML Snippets
apiVersion: v1
kind: Namespace
metadata:
name: prometheus
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: prometheus
rules:
- apiGroups: [""]
resources: ["nodes","services","endpoints","pods"]
verbs: ["get","list","watch"]
...After applying all manifests with kubectl apply -f prometheus-all.yaml, the monitoring stack becomes operational.
Result Showcase
Alert rules displayed in Alertmanager UI.
Target scrape status view.
Feishu group notifications.
Grafana dashboards visualizing metrics.
Conclusion
The article presented Prometheus fundamentals, a complete deployment on a Kubernetes‑based feature platform, and integration with alerting channels. While the current setup is basic and suitable for a small cluster, future work includes high‑availability Prometheus, richer rule sets, and scaling considerations.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
