How to Build a Flexible Kubernetes Monitoring System for Big Data with kube‑prometheus
This article explains how to design and implement a lightweight, flexible monitoring solution for big‑data components running on Kubernetes using kube‑prometheus, covering metric exposure methods, scrape configurations, alert rule design, exporter deployment, and practical examples with code snippets.
Introduction
This article introduces a monitoring system based on kube‑prometheus that collects metrics from applications running on Kubernetes and provides alerting capabilities in a simple and flexible way.
Background
Monitoring is a critical pain point for big‑data platforms, which require not only stable operation but also performance evaluation and optimization. Prometheus, as the most popular cloud‑native monitoring tool, integrates well with Kubernetes‑based big‑data components.
Design Overview
Monitoring Targets
Big‑data components running as pods in a Kubernetes cluster.
Metric Exposure Methods
Directly expose Prometheus metrics (pull).
Push metrics to prometheus‑pushgateway (push).
Use a custom exporter to convert other formats to Prometheus format (pull).
Metric Scraping
Prometheus pulls metrics from exporters or pushgateway endpoints. In Kubernetes, PodMonitor and ServiceMonitor are the preferred ways, while native Job can be used for custom service discovery via kubernetes_sd_config and relabel rules.
Prometheus mainly uses pull to scrape target services; supported job types include native Job, PodMonitor, and ServiceMonitor.
Alert Design
Alert Flow
Service anomaly occurs.
Prometheus generates an alert.
Alertmanager receives the alert.
Alertmanager processes the alert according to configured rules (grouping, silencing, notifications).
Dynamic Alert Configuration
Alert rules are split into alertmanager (handling strategy) and alertRule (specific rules). They are managed as PrometheusRule resources in Kubernetes.
global:
resolve_timeout: 5m
receivers:
- name: 'default'
- name: 'test.web.hook'
webhook_configs:
- url: 'http://alert-url'
route:
receiver: 'default'
group_wait: 30s
group_interval: 5m
repeat_interval: 2h
group_by: [groupId,instanceId]
routes:
- receiver: 'test.web.hook'
continue: true
match:
groupId: node-disk-usage
- receiver: 'test.web.hook'
continue: true
match:
groupId: kafka-topic-highstoreTechnical Implementation
Deploying Prometheus on Kubernetes
Use the kube‑prometheus project, which provides Jsonnet templates to generate Kubernetes manifests. The deployment includes CRDs, ServiceMonitors, PodMonitors, and alerting components.
# Create namespace and CRDs
$ kubectl create -f manifests/setup
$ until kubectl get servicemonitors --all-namespaces; do date; sleep 1; echo ""; done
$ kubectl create -f manifests/Enhanced Configuration with kubernetes_sd_config + relabel
Leverage native service discovery to automatically find pods and rewrite labels using relabel rules, reducing the need for many PodMonitor objects.
bigdata‑exporter Implementation
A custom exporter collects metrics from various big‑data components (HDFS, YARN, HBase, etc.) and exposes them in Prometheus format. It uses pod labels and annotations to discover targets and determine parsing rules.
labels:
bigData.metrics.object: pod
annotations:
bigData.metrics/scrape: "true"
bigData.metrics/scheme: "https"
bigData.metrics/path: "/jmx"
bigData.metrics/port: "29871"
bigData.metrics/role: "hdfs-nn,common"Alert Rule Examples
Disk usage alert:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: node-disk-usage
namespace: monitoring
spec:
groups:
- name: node-disk-usage
rules:
- alert: node-disk-usage
expr: 100*(1-node_filesystem_avail_bytes{mountpoint="${path}"}/node_filesystem_size_bytes{mountpoint="${path}"}) > ${thresholdValue}
for: 1m
labels:
groupId: node-disk-usage
userIds: super
receivers: SMS
annotations:
title: "Disk warning: node {{$labels.instance}} ${path} usage {{ $value }}%"
content: "Disk warning: node {{$labels.instance}} ${path} usage {{ $value }}%"Kafka lag alert:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: kafka-topic-highstore-${uniqueName}
namespace: monitoring
spec:
groups:
- name: kafka-topic-highstore
rules:
- alert: kafka-topic-highstore-${uniqueName}
expr: sum(kafka_consumergroup_lag{exporterType="kafka",consumergroup="${consumergroup}"}) > ${thresholdValue}
for: 1m
labels:
groupId: kafka-topic-highstore
instanceId: ${uniqueName}
userIds: super
receivers: SMS
annotations:
title: "KAFKA warning: consumer group ${consumergroup} lag {{ $value }}"
content: "KAFKA warning: consumer group ${consumergroup} lag {{ $value }}"Exporter Placement
Exporters can run as sidecars (1‑to‑1) or as independent deployments (1‑to‑many). Sidecars bind to the lifecycle of the target pod, while independent exporters provide lower coupling and can monitor multiple instances.
Utility Commands
Check metric format with promtool:
# Enter pod
$ kubectl -n=monitoring exec -it prometheus-k8s-0 sh
# Show help
$ promtool -h
# Validate metrics
$ curl -s http://ip:9999/metrics | promtool check metricsPort‑forward services for external access:
# Prometheus
$ nohup kubectl port-forward --address 0.0.0.0 service/prometheus-k8s 19090:9090 -n monitoring &
# Grafana
$ nohup kubectl port-forward --address 0.0.0.0 service/grafana 13000:3000 -n monitoring &
# Alertmanager
$ nohup kubectl port-forward --address 0.0.0.0 service/alertmanager-main 9093:9093 -n monitoring &ARM Image Support
List of images used by kube‑prometheus and their ARM compatibility, e.g., quay.io/prometheus/prometheus:v2.11.0 supports ARM, while some coreos images do not.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
