Big Data 9 min read

Practical Guide to Monitoring Flink Performance, Detecting Backpressure, and Configuring Alerts

This article explains how to use Flink's Web UI, Kafka metrics, and YARN monitoring to observe performance, diagnose backpressure, and set alert thresholds, providing code examples and practical tips for reliable stream processing in production environments.

Big Data Technology & Architecture

Apr 14, 2022

Practical Guide to Monitoring Flink Performance, Detecting Backpressure, and Configuring Alerts

In real Flink projects, monitoring performance, observing runtime status, and configuring alert policies are crucial; this article shares practical experience and step‑by‑step guidance.

1. Flink Web UI

The Flink Web UI is not enabled by default in local debug mode. Use the following code to start a local environment with the Web UI:

StreamExecutionEnvironment env = StreamExecutionEnvironment.createLocalEnvironmentWithWebUI(new Configuration());

Make sure to add the required Maven dependency:

<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-runtime-web_2.11</artifactId>
    <version>${flink.version}</version>
</dependency>

When running on YARN, the default initialization method works and the Web UI is accessible.

A typical job reads data from Kafka, processes it with Flink, and writes the result back to another Kafka topic. The UI shows sink metrics such as Sink__sink.numRecordsInPerSecond for each parallel instance, allowing you to calculate total throughput.

For example, with 50 parallel instances the total sink speed is 560 × 3 = 1680 records/s, while the source speed is 1737 records/s, which are roughly equal.

To verify whether the speed is normal, compare Flink’s consumption rate with the Kafka topic’s production rate (e.g., 1.66 k/s). If Kafka’s production exceeds Flink’s source/sink rates, backpressure is likely occurring.

2. Kafka Consumption Monitoring

Flink commits offsets to Kafka only during checkpoints, so Kafka lag appears as a saw‑tooth pattern. You can obtain the total lag with a shell command:

lag=`kafka/kafka_2.11-2.0.1/bin/kafka-consumer-groups.sh --bootstrap-server *.*.*.*:6667 --describe --group "$2" | grep "$3" | grep -v LAG | awk '{sum+=$5} END {print sum}'`

Define a virtual consumption speed F0 = lag / t, where t is the checkpoint interval. For example, with a peak lag of 30 000, checkpoint interval 60 s, and t = 3 min, F0 ≈ 167 records/s.

Compare F0 with the actual consumption speed F1 (obtained from the Web UI). Set a warning multiplier m (e.g., 2). When F0 exceeds F1 × m, trigger an alarm.

3. YARN Monitoring

In per‑job mode each Flink job has a unique name on YARN. You can check the job count with:

num=`yarn application-list | grep "FlinkJobName" | wc -l`

If num is less than 1, the Flink job has stopped. However, a missing YARN application does not always mean the job is dead; the cluster itself may be down. Combine YARN checks with Kafka lag monitoring to distinguish between YARN failures and normal Flink operation.

Summary

Assess Flink job health by jointly examining YARN status, Kafka lag, and the Flink Web UI.

Choose appropriate values for the warning multiplier (m) and checkpoint interval (t) to minimize false alarms while maintaining high SLA.

Ensure the Flink program itself is well‑optimized; aim for a peak performance at least twice the daily load, as monitoring alone cannot guarantee reliability if the job is poorly tuned.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Flink Kafka performance monitoring YARN backpressure

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.