Common Production Issues and Troubleshooting Guide for Apache Flink
This article compiles classic production problems encountered with Apache Flink, covering cluster sizing, checkpoint failures, backpressure diagnosis, client submission errors, resource allocation on YARN, and PyFlink UDF definitions, providing step‑by‑step troubleshooting methods and practical recommendations.
I previously wrote about various problems encountered in Flink production environments, and after seeing a collection of issues from Alibaba engineers, I have organized the top classic problems for reference.
How to Plan Cluster Size in Production
The first step is to consider operational metrics to establish a resource baseline, focusing on records per second and record size, the number of distinct keys and per‑key state size, and the frequency and access pattern of state updates.
Records per second and record size
Number of keys and state size per key
State update frequency and backend access pattern
Additionally, service‑level agreements (SLA) regarding downtime, latency, and maximum throughput directly affect capacity planning. Then, based on budget, evaluate available resources such as network capacity (including external services like Kafka, HDFS), disk bandwidth (especially when using disk‑based state backends like RocksDB), and the number of machines, CPU, and memory.
Flink Checkpoint Troubleshooting
A Flink checkpoint consists of several steps: JM trigger checkpoint Source receives the trigger and starts a snapshot, sending barriers downstream
Downstream receives barriers (all must arrive before checkpoint starts)
Task performs synchronous snapshot phase
Task performs asynchronous snapshot phase
Task completes snapshot and reports to JM
If any step fails, the checkpoint fails. Failures are categorized as Checkpoint Decline, Checkpoint Expire, or Checkpoint Slow.
Checkpoint Decline can be identified in logs, e.g.:
Decline checkpoint 10423 by task 0b60f08bf8984085b59f8d9bc74ce2e1 of job 85d268e6fbc19411185f7e4868a44178.Locate the execution ID in jobmanager.log, find the corresponding container, and inspect taskmanager.log for the root cause.
Checkpoint Expire occurs when a checkpoint exceeds its timeout. Example log:
Checkpoint 1 of job 85d268e6fbc19411185f7e4868a44178 expired before completing.Followed by possible late‑message logs to pinpoint the failing task.
Checkpoint Slow is diagnosed by checking stages such as source trigger speed, incremental checkpoints, backpressure, barrier alignment, main‑thread load, and synchronous/asynchronous snapshot phases.
Backpressure Troubleshooting
Backpressure indicates that a node cannot keep up with upstream data rate, causing upstream throttling. It can be detected via the Flink Web UI backpressure panel or Task Metrics.
The UI measures the frequency of tasks blocked on buffer requests; values < 0.1 are OK, 0.1‑0.5 are LOW, >0.5 are HIGH.
If backpressure is present, either the node’s sending rate is lower than its data generation rate (often in operators with one input and multiple outputs) or downstream nodes are slow, limiting the upstream node.
Root‑cause nodes may not always show high backpressure because the panel monitors the sender side. Therefore, identify the first subtask with high backpressure and examine whether it is the source of the issue or its downstream.
Key Task Metrics include outPoolUsage and inPoolUsage. High outPoolUsage suggests downstream pressure; high inPoolUsage suggests the subtask is propagating backpressure upstream. Mismatched values can indicate intermediate states or the actual root cause.
Common causes are data skew, inefficient user code (e.g., blocking operations, heavy regex), and TaskManager memory/GC problems. Profiling CPU usage and enabling G1 GC with detailed logs can help.
Common Client Issues
Application submission error: "Could not build the program from JAR file" often stems from missing Hadoop JARs on the classpath, leading to ClassNotFoundException for Yarn classes.
Jar version conflicts: Errors like NoSuchMethodError or IncompatibleClassChangeError require inspecting the dependency tree ( mvn dependency:tree) and using exclusions or the Maven Shade plugin to resolve conflicts.
Flink application resource allocation: If an application does not reach RUNNING, follow these steps:
Check application state (NEW_SAVING, SUBMITTED, ACCEPTED, RUNNING) and related YARN components.
Verify ApplicationMaster health via YARN UI or REST API, looking for messages such as queue or user AM resource limits, AM container registration delays, or activation waiting for resources.
Confirm whether YARN has pending resource requests.
Investigate scheduler issues: cluster or queue resource exhaustion, resource fragmentation, high‑priority applications monopolizing resources, or container launch failures.
TaskManager startup exception:
org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start container. This token is expired.This occurs when the token expires before the container is launched; Flink has been optimized (FLINK‑13184) to add validity checks and asynchronous launches.
Defining UDFs in PyFlink
In Flink 1.10, UDFs can be defined using any Python construct:
Extend ScalarFunction (PyFlink‑specific)
Lambda function
Named function
Callable class
class HashCodeMean(ScalarFunction):
def eval(self, i, j):
return (hash(i) + hash(j)) / 2 lambda i, j: (hash(i) + hash(j)) / 2 def hash_code_mean(i, j):
return (hash(i) + hash(j)) / 2 class CallableHashCodeMean(object):
def __call__(self, i, j):
return (hash(i) + hash(j)) / 2UDFs are registered with st_env.register_function("hash_code", hash_code_mean) and can be used in Table API/SQL, e.g.:
my_table.select("hash_code_mean(a, b)").insert_into("Results")Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
