Big Data 9 min read

How to Build a Real‑Time Flink Metrics Dashboard with Prometheus & Grafana

This article explains how to monitor Flink jobs running on YARN by leveraging Flink metrics, configuring reporters, defining custom metrics, and visualizing the data in real time with Prometheus, Grafana, and Graphite‑exporter, complete with deployment diagrams and code examples.

GrowingIO Tech Team
GrowingIO Tech Team
GrowingIO Tech Team
How to Build a Real‑Time Flink Metrics Dashboard with Prometheus & Grafana

Problem Statement

When running Flink jobs on a YARN cluster, teams often need to answer four key questions: Are long‑running jobs stable? Is the real‑time processing throughput sufficient, or do we need more resources? Is data quality reliable, or is there a risk of loss? Are existing resources under‑utilized or over‑provisioned?

Flink Metric Overview

Metric Types

Counter – continuously increasing or decreasing values.

Gauge – instantaneous values such as memory or disk usage.

Histogram – distribution of values.

Meter – rate of events.

Metric Reporters are configured in flink-conf.yaml and push metrics to external systems at job startup.

System Metrics include CPU, memory, I/O, threads, network, JVM garbage collection, etc.

User‑Defined Metrics allow developers to create custom metrics tailored to business needs.

Example Code

val counter = getRuntimeContext()
  .getMetricGroup()
  .addGroup("MyMetricsKey", "MyMetricsValue")
  .counter("myCounter")

Monitoring Dashboard Construction

1. Identify metrics to monitor:

System metrics (CPU, memory, GC, etc.).

Job count and restart/failure detection.

Operator message rates ( numRecordsIn, numRecordsOut) and their trends.

Message latency (max, min, average) between operators.

Memory and JVM GC status of TaskManagers.

2. Custom metrics:

Source side – monitor Kafka consumer group lag to detect consumption slowdown.

Sink side – implement a buffer cache that flushes batches to ClickHouse, HBase, Hive, Kafka, etc., and expose sinkPushCounter (records entering the cache) and sinkFlushCounter (records flushed) to detect data loss.

System Deployment

Flink monitoring system deployment diagram
Flink monitoring system deployment diagram

Configuration

In flink-conf.yml configure the metrics reporter to enable automatic metric export. Use metrics.reporter.grph.prefix="${JOB_NAME}" for dynamic job grouping and set metrics.latency.interval: 30000 to report operator latency every 30 seconds.

Prometheus Mapping

mappings:
- match: 'flink\.([\w-]+)\.(.*)\.taskmanager\.(\w+)\.Status\.(\w+)\.(\w+)\.([\w-]+)\.(\w+)'
  match_type: regex
  name: flink_taskmanager_Status_${4}_${5}_${6}_${7}
  labels:
    host: $2
    container: $3
    job_name: $1
- match: 'flink\.([\w-]+)\.(.*)\.taskmanager\.(\w+)\.([\w-]+)\.(.+)\.(\d+)\.Shuffle\.Netty\.(.*)'
  match_type: regex
  action: drop
  name: dropped
- match: 'flink\.([\w-]+)\.(.*)\.taskmanager\.(\w+)\.([\w-]+)\.(.+)\.(\d+)\.(Buffers|buffers)\.(.*)$'
  match_type: regex
  action: drop
  name: dropped
- match: 'flink\.([\w-]+)\.(.*)\.taskmanager\.(\w+)\.([\w-]+)\.(.+)\.(\d+)\.(\w+)\-(\w+)\.(\w+)'
  match_type: regex
  name: flink_taskmanager_operator_${7}_${9}
  labels:
    host: $2
    container: $3
    job_name: $1
    operator: $5
    task: $6
    custom_metric: $7
    sink_instance: $8
- match: 'flink\.([\w-]+)\.(.*)\.taskmanager\.(\w+)\.([\w-]+)\.(.+)\.(\d+)\.(\w+)\.(\w+)'
  match_type: regex
  name: flink_taskmanager_operator_${7}_${8}
  labels:
    host: $2
    container: $3
    job_name: $1
    operator: $5
    task: $6
- match: 'flink\.([\w-]+)\.(.*)\.jobmanager\.Status\.(.*)'
  match_type: regex
  name: flink_jobmanager_Status_$3
  labels:
    host: $2
    job_name: $1
- match: 'flink\.([\w-]+)\.(.*)\.jobmanager\.(\w+)$'
  match_type: regex
  name: flink_jobmanager_${3}
  labels:
    host: $2
    job_name: $1
- match: 'flink\.([\w-]+)\.(.*)\.jobmanager\.(.*)\.(\w+)'
  match_type: regex
  name: flink_jobmanager_${4}
  labels:
    host: $2
    job_name: $1
- match: 'flink\.([\w-]+)\.(.*)\.jobmanager\.(.*)\.(.*)\.(.*)'
  match_type: regex
  name: flink_jobmanager_${4}_${5}
  labels:
    host: $2
    job_name: $1
- match: '.'
  match_type: regex
  action: drop
  name: "dropped"

Real‑Time Dashboard Views

Kafka lag monitoring reveals data backlog and consumption delay.

Kafka lag dashboard
Kafka lag dashboard

Job count monitoring detects failed or restarted jobs.

Job count dashboard
Job count dashboard

Source and sink throughput charts show processing capacity.

Throughput dashboard
Throughput dashboard

Message latency metrics expose stream processing response delays.

Latency dashboard
Latency dashboard

JVM memory and GC charts help allocate resources efficiently.

JVM memory dashboard
JVM memory dashboard

Comparison of buffer‑cache push vs. flush counters highlights potential data loss.

Buffer cache dashboard
Buffer cache dashboard
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataFlinkMetricsPrometheusreal-time monitoringGrafana
GrowingIO Tech Team
Written by

GrowingIO Tech Team

The official technical account of GrowingIO, showcasing our tech innovations, experience summaries, and cutting‑edge black‑tech.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.