How to Build a Real‑Time Flink Metrics Dashboard with Prometheus & Grafana
This article explains how to monitor Flink jobs running on YARN by leveraging Flink metrics, configuring reporters, defining custom metrics, and visualizing the data in real time with Prometheus, Grafana, and Graphite‑exporter, complete with deployment diagrams and code examples.
Problem Statement
When running Flink jobs on a YARN cluster, teams often need to answer four key questions: Are long‑running jobs stable? Is the real‑time processing throughput sufficient, or do we need more resources? Is data quality reliable, or is there a risk of loss? Are existing resources under‑utilized or over‑provisioned?
Flink Metric Overview
Metric Types
Counter – continuously increasing or decreasing values.
Gauge – instantaneous values such as memory or disk usage.
Histogram – distribution of values.
Meter – rate of events.
Metric Reporters are configured in flink-conf.yaml and push metrics to external systems at job startup.
System Metrics include CPU, memory, I/O, threads, network, JVM garbage collection, etc.
User‑Defined Metrics allow developers to create custom metrics tailored to business needs.
Example Code
val counter = getRuntimeContext()
.getMetricGroup()
.addGroup("MyMetricsKey", "MyMetricsValue")
.counter("myCounter")Monitoring Dashboard Construction
1. Identify metrics to monitor:
System metrics (CPU, memory, GC, etc.).
Job count and restart/failure detection.
Operator message rates ( numRecordsIn, numRecordsOut) and their trends.
Message latency (max, min, average) between operators.
Memory and JVM GC status of TaskManagers.
2. Custom metrics:
Source side – monitor Kafka consumer group lag to detect consumption slowdown.
Sink side – implement a buffer cache that flushes batches to ClickHouse, HBase, Hive, Kafka, etc., and expose sinkPushCounter (records entering the cache) and sinkFlushCounter (records flushed) to detect data loss.
System Deployment
Configuration
In flink-conf.yml configure the metrics reporter to enable automatic metric export. Use metrics.reporter.grph.prefix="${JOB_NAME}" for dynamic job grouping and set metrics.latency.interval: 30000 to report operator latency every 30 seconds.
Prometheus Mapping
mappings:
- match: 'flink\.([\w-]+)\.(.*)\.taskmanager\.(\w+)\.Status\.(\w+)\.(\w+)\.([\w-]+)\.(\w+)'
match_type: regex
name: flink_taskmanager_Status_${4}_${5}_${6}_${7}
labels:
host: $2
container: $3
job_name: $1
- match: 'flink\.([\w-]+)\.(.*)\.taskmanager\.(\w+)\.([\w-]+)\.(.+)\.(\d+)\.Shuffle\.Netty\.(.*)'
match_type: regex
action: drop
name: dropped
- match: 'flink\.([\w-]+)\.(.*)\.taskmanager\.(\w+)\.([\w-]+)\.(.+)\.(\d+)\.(Buffers|buffers)\.(.*)$'
match_type: regex
action: drop
name: dropped
- match: 'flink\.([\w-]+)\.(.*)\.taskmanager\.(\w+)\.([\w-]+)\.(.+)\.(\d+)\.(\w+)\-(\w+)\.(\w+)'
match_type: regex
name: flink_taskmanager_operator_${7}_${9}
labels:
host: $2
container: $3
job_name: $1
operator: $5
task: $6
custom_metric: $7
sink_instance: $8
- match: 'flink\.([\w-]+)\.(.*)\.taskmanager\.(\w+)\.([\w-]+)\.(.+)\.(\d+)\.(\w+)\.(\w+)'
match_type: regex
name: flink_taskmanager_operator_${7}_${8}
labels:
host: $2
container: $3
job_name: $1
operator: $5
task: $6
- match: 'flink\.([\w-]+)\.(.*)\.jobmanager\.Status\.(.*)'
match_type: regex
name: flink_jobmanager_Status_$3
labels:
host: $2
job_name: $1
- match: 'flink\.([\w-]+)\.(.*)\.jobmanager\.(\w+)$'
match_type: regex
name: flink_jobmanager_${3}
labels:
host: $2
job_name: $1
- match: 'flink\.([\w-]+)\.(.*)\.jobmanager\.(.*)\.(\w+)'
match_type: regex
name: flink_jobmanager_${4}
labels:
host: $2
job_name: $1
- match: 'flink\.([\w-]+)\.(.*)\.jobmanager\.(.*)\.(.*)\.(.*)'
match_type: regex
name: flink_jobmanager_${4}_${5}
labels:
host: $2
job_name: $1
- match: '.'
match_type: regex
action: drop
name: "dropped"Real‑Time Dashboard Views
Kafka lag monitoring reveals data backlog and consumption delay.
Job count monitoring detects failed or restarted jobs.
Source and sink throughput charts show processing capacity.
Message latency metrics expose stream processing response delays.
JVM memory and GC charts help allocate resources efficiently.
Comparison of buffer‑cache push vs. flush counters highlights potential data loss.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
GrowingIO Tech Team
The official technical account of GrowingIO, showcasing our tech innovations, experience summaries, and cutting‑edge black‑tech.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
