Big Data 18 min read

Common Production Issues and Troubleshooting Guide for Apache Flink

This article compiles classic production problems encountered with Apache Flink, covering cluster sizing, checkpoint failures, backpressure diagnosis, client submission errors, resource allocation on YARN, and PyFlink UDF definitions, providing step‑by‑step troubleshooting methods and practical recommendations.

Big Data Technology & Architecture

Jul 12, 2021

Common Production Issues and Troubleshooting Guide for Apache Flink

I previously wrote about various problems encountered in Flink production environments, and after seeing a collection of issues from Alibaba engineers, I have organized the top classic problems for reference.

How to Plan Cluster Size in Production

The first step is to consider operational metrics to establish a resource baseline, focusing on records per second and record size, the number of distinct keys and per‑key state size, and the frequency and access pattern of state updates.

Records per second and record size

Number of keys and state size per key

State update frequency and backend access pattern

Additionally, service‑level agreements (SLA) regarding downtime, latency, and maximum throughput directly affect capacity planning. Then, based on budget, evaluate available resources such as network capacity (including external services like Kafka, HDFS), disk bandwidth (especially when using disk‑based state backends like RocksDB), and the number of machines, CPU, and memory.

Flink Checkpoint Troubleshooting

A Flink checkpoint consists of several steps: JM trigger checkpoint Source receives the trigger and starts a snapshot, sending barriers downstream

Downstream receives barriers (all must arrive before checkpoint starts)

Task performs synchronous snapshot phase

Task performs asynchronous snapshot phase

Task completes snapshot and reports to JM

If any step fails, the checkpoint fails. Failures are categorized as Checkpoint Decline, Checkpoint Expire, or Checkpoint Slow.

Checkpoint Decline can be identified in logs, e.g.:

Decline checkpoint 10423 by task 0b60f08bf8984085b59f8d9bc74ce2e1 of job 85d268e6fbc19411185f7e4868a44178.

Locate the execution ID in jobmanager.log, find the corresponding container, and inspect taskmanager.log for the root cause.

Checkpoint Expire occurs when a checkpoint exceeds its timeout. Example log:

Checkpoint 1 of job 85d268e6fbc19411185f7e4868a44178 expired before completing.

Followed by possible late‑message logs to pinpoint the failing task.

Checkpoint Slow is diagnosed by checking stages such as source trigger speed, incremental checkpoints, backpressure, barrier alignment, main‑thread load, and synchronous/asynchronous snapshot phases.

Backpressure Troubleshooting

Backpressure indicates that a node cannot keep up with upstream data rate, causing upstream throttling. It can be detected via the Flink Web UI backpressure panel or Task Metrics.

The UI measures the frequency of tasks blocked on buffer requests; values < 0.1 are OK, 0.1‑0.5 are LOW, >0.5 are HIGH.

If backpressure is present, either the node’s sending rate is lower than its data generation rate (often in operators with one input and multiple outputs) or downstream nodes are slow, limiting the upstream node.

Root‑cause nodes may not always show high backpressure because the panel monitors the sender side. Therefore, identify the first subtask with high backpressure and examine whether it is the source of the issue or its downstream.

Key Task Metrics include outPoolUsage and inPoolUsage. High outPoolUsage suggests downstream pressure; high inPoolUsage suggests the subtask is propagating backpressure upstream. Mismatched values can indicate intermediate states or the actual root cause.

Common causes are data skew, inefficient user code (e.g., blocking operations, heavy regex), and TaskManager memory/GC problems. Profiling CPU usage and enabling G1 GC with detailed logs can help.

Common Client Issues

Application submission error: "Could not build the program from JAR file" often stems from missing Hadoop JARs on the classpath, leading to ClassNotFoundException for Yarn classes.

Jar version conflicts: Errors like NoSuchMethodError or IncompatibleClassChangeError require inspecting the dependency tree ( mvn dependency:tree) and using exclusions or the Maven Shade plugin to resolve conflicts.

Flink application resource allocation: If an application does not reach RUNNING, follow these steps:

Check application state (NEW_SAVING, SUBMITTED, ACCEPTED, RUNNING) and related YARN components.

Verify ApplicationMaster health via YARN UI or REST API, looking for messages such as queue or user AM resource limits, AM container registration delays, or activation waiting for resources.

Confirm whether YARN has pending resource requests.

Investigate scheduler issues: cluster or queue resource exhaustion, resource fragmentation, high‑priority applications monopolizing resources, or container launch failures.

TaskManager startup exception:

org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start container. This token is expired.

This occurs when the token expires before the container is launched; Flink has been optimized (FLINK‑13184) to add validity checks and asynchronous launches.

Defining UDFs in PyFlink

In Flink 1.10, UDFs can be defined using any Python construct:

Extend ScalarFunction (PyFlink‑specific)

Lambda function

Named function

Callable class

class HashCodeMean(ScalarFunction):
    def eval(self, i, j):
        return (hash(i) + hash(j)) / 2

lambda i, j: (hash(i) + hash(j)) / 2

def hash_code_mean(i, j):
    return (hash(i) + hash(j)) / 2

class CallableHashCodeMean(object):
    def __call__(self, i, j):
        return (hash(i) + hash(j)) / 2

UDFs are registered with st_env.register_function("hash_code", hash_code_mean) and can be used in Table API/SQL, e.g.:

my_table.select("hash_code_mean(a, b)").insert_into("Results")

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Flink troubleshooting Checkpoint backpressure Production

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.