Big Data 9 min read

How to Build a Real‑Time Flink Metrics Dashboard with Prometheus & Grafana

This article explains how to monitor Flink jobs running on YARN by leveraging Flink metrics, configuring reporters, defining custom metrics, and visualizing the data in real time with Prometheus, Grafana, and Graphite‑exporter, complete with deployment diagrams and code examples.

GrowingIO Tech Team

Sep 23, 2021

How to Build a Real‑Time Flink Metrics Dashboard with Prometheus & Grafana

Problem Statement

When running Flink jobs on a YARN cluster, teams often need to answer four key questions: Are long‑running jobs stable? Is the real‑time processing throughput sufficient, or do we need more resources? Is data quality reliable, or is there a risk of loss? Are existing resources under‑utilized or over‑provisioned?

Flink Metric Overview

Metric Types

Counter – continuously increasing or decreasing values.

Gauge – instantaneous values such as memory or disk usage.

Histogram – distribution of values.

Meter – rate of events.

Metric Reporters are configured in flink-conf.yaml and push metrics to external systems at job startup.

System Metrics include CPU, memory, I/O, threads, network, JVM garbage collection, etc.

User‑Defined Metrics allow developers to create custom metrics tailored to business needs.

Example Code

val counter = getRuntimeContext()
  .getMetricGroup()
  .addGroup("MyMetricsKey", "MyMetricsValue")
  .counter("myCounter")

Monitoring Dashboard Construction

1. Identify metrics to monitor:

System metrics (CPU, memory, GC, etc.).

Job count and restart/failure detection.

Operator message rates ( numRecordsIn, numRecordsOut) and their trends.

Message latency (max, min, average) between operators.

Memory and JVM GC status of TaskManagers.

2. Custom metrics:

Source side – monitor Kafka consumer group lag to detect consumption slowdown.

Sink side – implement a buffer cache that flushes batches to ClickHouse, HBase, Hive, Kafka, etc., and expose sinkPushCounter (records entering the cache) and sinkFlushCounter (records flushed) to detect data loss.

System Deployment

Configuration

In flink-conf.yml configure the metrics reporter to enable automatic metric export. Use metrics.reporter.grph.prefix="${JOB_NAME}" for dynamic job grouping and set metrics.latency.interval: 30000 to report operator latency every 30 seconds.

Prometheus Mapping

mappings:
- match: 'flink\.([\w-]+)\.(.*)\.taskmanager\.(\w+)\.Status\.(\w+)\.(\w+)\.([\w-]+)\.(\w+)'
  match_type: regex
  name: flink_taskmanager_Status_${4}_${5}_${6}_${7}
  labels:
    host: $2
    container: $3
    job_name: $1
- match: 'flink\.([\w-]+)\.(.*)\.taskmanager\.(\w+)\.([\w-]+)\.(.+)\.(\d+)\.Shuffle\.Netty\.(.*)'
  match_type: regex
  action: drop
  name: dropped
- match: 'flink\.([\w-]+)\.(.*)\.taskmanager\.(\w+)\.([\w-]+)\.(.+)\.(\d+)\.(Buffers|buffers)\.(.*)$'
  match_type: regex
  action: drop
  name: dropped
- match: 'flink\.([\w-]+)\.(.*)\.taskmanager\.(\w+)\.([\w-]+)\.(.+)\.(\d+)\.(\w+)\-(\w+)\.(\w+)'
  match_type: regex
  name: flink_taskmanager_operator_${7}_${9}
  labels:
    host: $2
    container: $3
    job_name: $1
    operator: $5
    task: $6
    custom_metric: $7
    sink_instance: $8
- match: 'flink\.([\w-]+)\.(.*)\.taskmanager\.(\w+)\.([\w-]+)\.(.+)\.(\d+)\.(\w+)\.(\w+)'
  match_type: regex
  name: flink_taskmanager_operator_${7}_${8}
  labels:
    host: $2
    container: $3
    job_name: $1
    operator: $5
    task: $6
- match: 'flink\.([\w-]+)\.(.*)\.jobmanager\.Status\.(.*)'
  match_type: regex
  name: flink_jobmanager_Status_$3
  labels:
    host: $2
    job_name: $1
- match: 'flink\.([\w-]+)\.(.*)\.jobmanager\.(\w+)$'
  match_type: regex
  name: flink_jobmanager_${3}
  labels:
    host: $2
    job_name: $1
- match: 'flink\.([\w-]+)\.(.*)\.jobmanager\.(.*)\.(\w+)'
  match_type: regex
  name: flink_jobmanager_${4}
  labels:
    host: $2
    job_name: $1
- match: 'flink\.([\w-]+)\.(.*)\.jobmanager\.(.*)\.(.*)\.(.*)'
  match_type: regex
  name: flink_jobmanager_${4}_${5}
  labels:
    host: $2
    job_name: $1
- match: '.'
  match_type: regex
  action: drop
  name: "dropped"

Real‑Time Dashboard Views

Kafka lag monitoring reveals data backlog and consumption delay.

Job count monitoring detects failed or restarted jobs.

Source and sink throughput charts show processing capacity.

Message latency metrics expose stream processing response delays.

JVM memory and GC charts help allocate resources efficiently.

Comparison of buffer‑cache push vs. flush counters highlights potential data loss.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Flink Metrics Prometheus real-time monitoring Grafana

Written by

GrowingIO Tech Team

The official technical account of GrowingIO, showcasing our tech innovations, experience summaries, and cutting‑edge black‑tech.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.