Operations 19 min read

How to Build a Scalable Prometheus Monitoring System for Big Data on Kubernetes

This guide explains how to design, configure, and implement a Prometheus‑based monitoring solution for big‑data components running in Kubernetes, covering metric exposure methods, scrape configurations, alerting architecture, dynamic rule management, exporter deployment, and practical examples with full YAML snippets.

Java Architect Handbook
Java Architect Handbook
Java Architect Handbook
How to Build a Scalable Prometheus Monitoring System for Big Data on Kubernetes

Big‑data platforms running on Kubernetes need reliable monitoring and alerting to ensure stable operation and performance tuning. Prometheus, the de‑facto cloud‑native monitoring system, can collect metrics from components via direct exposure, a push gateway, or custom exporters.

Design Overview

The monitoring system must answer four questions: what to monitor, how the targets expose metrics, how Prometheus scrapes those metrics, and how alert rules are dynamically configured and managed.

Monitoring Targets

All big‑data services run as pods in a Kubernetes cluster. Each pod may expose metrics directly, push them to prometheus-pushgateway, or use a custom exporter that converts native metrics to the Prometheus format.

Metric Exposure Methods

Directly expose Prometheus metrics (pull model).

Push metrics to prometheus-pushgateway (push model).

Deploy a custom exporter that translates component‑specific metrics into Prometheus format.

Most components already provide official or third‑party exporters; otherwise a custom exporter can be built.

Scrape Configuration

Prometheus pulls metrics using one of the following job types:

Native Job : manual scrape_configs entries.

PodMonitor : uses the Prometheus Operator to discover pods with the appropriate annotations.

ServiceMonitor : discovers services and their endpoints.

When running in Kubernetes, PodMonitor is usually preferred for its simplicity. The selector is defined in prometheus-prometheus.yaml and should include serviceMonitorSelector, podMonitorSelector, ruleSelector, and alertmanagers. The kubernetes_sd_config with relabeling can also be used for automatic pod discovery.

annotations:
  prometheus.io/scrape: "true"
  prometheus.io/scheme: "http"
  prometheus.io/path: "/metrics"
  prometheus.io/port: "19091"

Alerting Design

Prometheus alerting follows a five‑step flow: service failure → Prometheus fires an alert → Alertmanager receives it → Alertmanager applies routing, grouping, and inhibition rules → notifications are sent (e.g., SMS, email, webhook).

Dynamic Alert Configuration

Alerting rules are split into two parts: alertmanager: defines receivers and routing policies. alertRule: contains the actual PrometheusRule objects.

Custom alert platforms are integrated via webhook receivers; Alertmanager handles grouping and suppression, while the custom platform performs business‑specific processing.

Alertmanager Example

kubectl -n monitoring create secret generic alertmanager-main \
  --from-file=alertmanager.yaml --dry-run -o yaml | \
  kubectl -n monitoring apply -f -

global:
  resolve_timeout: 5m
receivers:
  - name: 'default'
  - name: 'test.web.hook'
    webhook_configs:
      - url: 'http://alert-url'
route:
  receiver: 'default'
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 2h
  group_by: [groupId,instanceId]
  routes:
    - receiver: 'test.web.hook'
      continue: true
      match:
        groupId: node-disk-usage
    - receiver: 'test.web.hook'
      continue: true
      match:
        groupId: kafka-topic-highstore

AlertRule Example (Disk Usage)

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: node-disk-usage
  namespace: monitoring
spec:
  groups:
    - name: node-disk-usage
      rules:
        - alert: node-disk-usage
          expr: 100*(1 - node_filesystem_avail_bytes{mountpoint="${path}"}/node_filesystem_size_bytes{mountpoint="${path}"}) > ${thresholdValue}
          for: 1m
          labels:
            groupId: node-disk-usage
            userIds: super
            receivers: SMS
          annotations:
            title: "Disk warning: node {{ $labels.instance }} path ${path} usage {{ $value }}%"
            content: "Disk warning: node {{ $labels.instance }} path ${path} usage {{ $value }}%"

Exporter Implementation

A generic bigdata-exporter collects JMX metrics from components such as HDFS, YARN, HBase, and Kafka, then exposes them in Prometheus format. Targets are discovered using pod labels and annotations.

labels:
  bigData.metrics.object: pod
annotations:
  bigData.metrics/scrape: "true"
  bigData.metrics/scheme: "https"
  bigData.metrics/path: "/jmx"
  bigData.metrics/port: "29871"
  bigData.metrics/role: "hdfs-nn,common"

Deployment Steps

Deploy kube-prometheus (or prometheus-operator) matching the Kubernetes version (e.g., k8s 1.14 → kube‑prometheus 0.3).

Generate manifests with jsonnet (or use the provided defaults in the manifests directory).

Apply the manifests to the cluster.

Optionally expose services externally with kubectl port-forward:

# Prometheus UI
nohup kubectl port-forward --address 0.0.0.0 service/prometheus-k8s 19090:9090 -n monitoring &
# Grafana UI
nohup kubectl port-forward --address 0.0.0.0 service/grafana 13000:3000 -n monitoring &
# Alertmanager UI
nohup kubectl port-forward --address 0.0.0.0 service/alertmanager-main 9093:9093 -n monitoring &

Validate metric syntax with promtool inside the Prometheus pod:

# Enter the pod
kubectl -n monitoring exec -it prometheus-k8s-0 -- sh
# Check metrics
curl -s http://localhost:9090/metrics | promtool check metrics

Additional Tips

Use for to define how long a metric must be abnormal before firing an alert. group_wait controls the initial delay before the first grouped alert is sent. group_interval is used when new alerts appear in an existing group. repeat_interval repeats unchanged alerts, including recovery notifications.

Exporters can run as sidecars (1:1) or as independent services (1:many) depending on coupling requirements.

Full configuration files are available in the GitHub repository: https://github.com/linshenkx/kube-prometheus-enhance

cloud-nativeKubernetesalertingPrometheusExportersBig Data Monitoringkube-prometheus
Java Architect Handbook
Written by

Java Architect Handbook

Focused on Java interview questions and practical article sharing, covering algorithms, databases, Spring Boot, microservices, high concurrency, JVM, Docker containers, and ELK-related knowledge. Looking forward to progressing together with you.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.