Big Data 39 min read

Common Production Issues and Troubleshooting Guide for Apache Flink

This article compiles a comprehensive list of common production problems encountered with Apache Flink, covering cluster sizing, checkpoint failures, backpressure analysis, resource allocation, deployment errors, UDF definitions, data skew, Kafka configurations, and provides detailed troubleshooting steps and best‑practice recommendations.

Big Data Technology & Architecture
Big Data Technology & Architecture
Big Data Technology & Architecture
Common Production Issues and Troubleshooting Guide for Apache Flink

This article aggregates a wide range of practical problems that arise when running Apache Flink in production environments and offers step‑by‑step troubleshooting methods.

How to Plan Cluster Size

First, evaluate the operational metrics that define the baseline resources, such as records per second and record size, the number of distinct keys and per‑key state size, and the frequency of state updates together with the state backend access pattern. Then consider SLA requirements (latency, downtime, throughput) and budget constraints, including network capacity, disk bandwidth (especially when using RocksDB), and the number of machines, CPU, and memory available.

Flink Checkpoint Problems

Checkpoint consists of several stages (JM trigger, source snapshot, barrier propagation, task snapshot sync/async, and JM reporting). Any failure in these stages aborts the whole checkpoint. Failures are mainly divided into Checkpoint Decline and Checkpoint Expire. Example log for decline:

Decline checkpoint 10423 by task 0b60f08bf8984085b59f8d9bc74ce2e1 of job 85d268e6fbc19411185f7e4868a44178.

For expiration, the log looks like:

Checkpoint 1 of job 85d268e6fbc19411185f7e4868a44178 expired before completing.

Locate the failing task by searching the execution ID in the JobManager logs and then inspect the corresponding TaskManager logs.

Backpressure Diagnosis

Backpressure indicates that a downstream node cannot keep up with upstream data rates. It can be detected via the Flink Web UI backpressure panel or Task Metrics (outPoolUsage/inPoolUsage). High outPoolUsage usually means downstream pressure, while high inPoolUsage suggests upstream pressure. The root cause is often data skew or inefficient user code.

Common Client and Deployment Issues

Typical errors include "Could not build the program from JAR file" (often caused by missing Hadoop JARs), dependency conflicts (ClassNotFoundException, NoSuchMethodError), and YARN resource limits (AM resource limit exceeded, slot allocation timeout). Use mvn dependency:tree to identify conflicts and apply exclusions or the Maven Shade plugin.

PyFlink UDF Definition

UDFs can be defined in several ways:

class HashCodeMean(ScalarFunction):
    def eval(self, i, j):
        return (hash(i) + hash(j)) / 2
lambda i, j: (hash(i) + hash(j)) / 2
def hash_code_mean(i, j):
    return (hash(i) + hash(j)) / 2

Register and use them with st_env.register_function("hash_code", hash_code_mean) and then call in Table API/SQL.

Data Skew and Kafka Message Size

When a small grouping key causes severe data skew, split the workload into separate pipelines. For oversized Kafka messages (>1 MB), increase fetch.message.max.bytes or enable compression, or slice messages upstream and re‑assemble them downstream.

Kafka Producer Ack Configuration

Setting props.put("acks", "0") reduces latency at the cost of reliability, which may be acceptable for click‑stream data where loss is tolerable.

Additional Error Cases and Fixes

Slot removal errors – increase slot memory or reduce shared tasks per slot.

TaskManager heartbeat timeout – check network stability or increase akka.ask.timeout and web.timeout.

Checkpoint timeout – raise checkpointConf.setCheckpointTimeout from the default 10 min if needed.

MySQL CDC lock issues – grant RELOAD privilege or set debezium.snapshot.locking.mode = 'none'.

State migration errors – avoid changing key serializers or upgrade Flink to a version where the bug is fixed.

General Recommendations

Test Flink jobs on a dedicated cluster before production, monitor resource usage closely, configure appropriate restart strategies (e.g., fixed‑delay with high attempt count), and keep JDK versions up‑to‑date (prefer JDK 8 u181 or newer). Properly isolate slots, enable G1 GC with detailed logs, and adjust state.backend.rocksdb.files.open if too many files are opened.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Apache FlinkResource ManagementKafkaUDFCheckpointbackpressureProduction troubleshooting
Big Data Technology & Architecture
Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.