Big Data 21 min read

Common Flink Task Submission Issues and Solutions on YARN

This article compiles frequent Flink job submission problems on YARN—including WordCount jar errors, HBase dependency conflicts, MySQL timeout, checkpoint restoration failures, parallelism limits, and unexpected container termination—provides root‑cause analysis and step‑by‑step remediation instructions.

Big Data Technology & Architecture

Feb 24, 2023

Common Flink Task Submission Issues and Solutions on YARN

This article collects common problems encountered when submitting Flink jobs on a YARN cluster and provides analysis and concrete solutions.

1. Submitting the built‑in WordCount.jar – The job fails with a FutureUtils$RetryException and “Service temporarily unavailable” messages. The cause is output path conflict; the fix is to use different output directories for each run.

bin/flink run -m yarn-cluster -yjm 1024 -ytm 1024 ./examples/batch/WordCount.jar -input hdfs://hadoop01:9000/test/word -output hdfs://hadoop01:9000/test/result1
bin/flink run -m yarn-cluster -yjm 1024 -ytm 1024 ./examples/batch/WordCount.jar -input hdfs://hadoop01:9000/test/word -output hdfs://hadoop01:9000/test/result2

2. Flink batch job that writes to HBase – After packaging the Maven project, the job crashes on the ApplicationMaster with a ClusterDeploymentException. The root cause is a Hadoop version conflict introduced by the hbase‑client dependency (built for Hadoop 2.8 while the cluster runs Hadoop 3.1.4). Removing the transitive Hadoop dependencies from hbase‑client resolves the issue.

<dependency>
    <groupId>org.apache.hbase</groupId>
    <artifactId>hbase-server</artifactId>
    <version>2.4.1</version>
    <exclusions>
        <exclusion>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
        </exclusion>
        ... (other Hadoop exclusions) ...
    </exclusions>
</dependency>

3. MySQL connection timeout in a Flink streaming job – A timer‑driven daily aggregation keeps a MySQL connection idle for more than the server’s wait_timeout, causing a CommunicationsException. The fix is to catch the exception and re‑establish the connection, or adjust MySQL timeout settings.

try {
    execUpdate(value);
    System.out.println("mysql 连接未断开更新");
} catch (CommunicationsException e) {
    connection = ConnectionUtil.getConnection();
    execUpdate(value);
    System.out.println("mysql 连接断开重连更新");
}

4. Checkpoint restore failure – StateMigrationException indicates incompatible state serializers. The solution is to restore from the latest valid checkpoint and ensure the checkpoint directory contains the correct files.

/usr/lib/flink/flink/bin/flink run -d -m yarn-cluster -yjm 1024 -ytm 1024 -s hdfs://hadoop01:9000/flink/checkpoints/preprocess/60366092809bbc2b5785591f8014f759/chk-815 /usr/lib/flink/jars/test/xxx.jar

5. Increasing parallelism on YARN – Adjust yarn.scheduler.capacity.maximum-am-resource-percent in capacity-scheduler.xml (e.g., from 0.1 to 0.2) to allow more ApplicationMasters to run concurrently.

6. Unexpected job termination – Containers were killed because the temporary Hadoop PID directory under /tmp was cleaned by the OS. Changing hadoop.tmp.dir to a persistent location in core-site.xml prevents the loss.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Flink mysql HBase troubleshooting YARN Checkpoint

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.