Big Data 19 min read

How We Upgraded Our Flink Cluster from 1.10 to 1.14.6 and Overcame Common Pitfalls

This article details the background of a Flink 1.10 cluster on Huawei Cloud, the technical challenges that prompted an upgrade, a step‑by‑step migration plan to Flink 1.14.6, troubleshooting of frequent errors, precautionary measures, and the performance and operational benefits achieved after the upgrade.

WeiLi Technology Team
WeiLi Technology Team
WeiLi Technology Team
How We Upgraded Our Flink Cluster from 1.10 to 1.14.6 and Overcame Common Pitfalls

Background

Micro Carp big data Flink cluster built on Huawei Cloud in 2020, version 1.10.0, handling real‑time log processing, metrics, ETL, recommendation, and ad bidding. Over 100 Flink jobs, 1800+ slots, processing more than one billion records and roughly 200 GB of data daily.

Motivation for Upgrade

Issues encountered: incomplete Flink SQL DDL support, checkpoint alignment failures under back‑pressure, insufficient connector support, desire to replace Canal+Kafka with Flink CDC, correctness/performance of dimension‑table joins, lack of Hive integration, and the need to stay on a newer release (latest 1.16).

Upgrade Plan

Feasibility assessment

Test environment validation

Deploy Flink 1.14.6 and verify capabilities

Integrate Flink CDC ecosystem

Conduct stress tests

Confirm rollback and job‑monitoring scheme

Review jobs, impact, and priority

Deploy Flink 1.14.6 to production

Upgrade job code

Gradual migration (gray release)

Continuous observation and issue handling

Pitfalls & Solutions

3.1 Flink yarn‑session creation failure

./yarn-session.sh -nm test -d

Error: A JNI error has occurred, NoClassDefFoundError for org/apache/hadoop/yarn/exceptions/YarnException.

Fix: add export HADOOP_CLASSPATH=`hadoop classpath` to /etc/profile and run source /etc/profile to apply.

3.2 Local mode start error

[root@node1 bin]# sh start-cluster.sh
/data/flink/flink-1.15.2/bin/config.sh: line 32: unexpected '<' syntax error

Cause: the script uses Bash‑only syntax; run ./start-cluster.sh directly instead of using sh.

3.3 Hive DDL failure

Missing Hive connector jars; place the appropriate hive-common.jar, hive-exec.jar, etc., into $FLINK_HOME/lib and restart the SQL client.

Also adjust the flink-sql-connector-hive pom.xml to match the CDH‑Hive version, rebuild, and replace the jar.

Example DDL and error screenshot are shown in the original article.

3.4 NoClassDefFoundError / NoSuchMethodError

Typical reasons: missing jar or version conflict. Use the following command to locate the missing class or method:

ls *.jar | while read jarfile; do
  echo "$jarfile"
  jar -tf $jarfile | grep "MissingClassOrMethod"
done

If the class exists, resolve any version conflict accordingly.

3.5 Zookeeper node quota exhausted

Check quota with:

# View quota configuration
bin/zkCli.sh -server xxx:2181 listquota /flink
# View current usage (quota reached 1000)
get /zookeeper/quota/flink/zookeeper_stats

Delete excess nodes, e.g.:

deleteall /flink/application_1642060369182_4949589

After cleanup, creating a new yarn‑session succeeds.

3.6 Direct buffer memory OOM

Increase off‑heap memory in flink-conf.yaml:

taskmanager.memory.framework.off-heap.size: 256m
taskmanager.memory.task.off-heap.size: 128m

3.7 Netty RemoteTransportException

TaskManager containers were killed for exceeding physical memory limits. Raise the process memory size or adjust managed memory:

# Default 1728m
taskmanager.memory.process.size: 4096m

3.8 Deprecated / removed methods

Examples of API changes in newer Flink versions: rocksDBStateBackend.enableTtlCompactionFilter() removed (TTL enabled by default). split and select removed; use side outputs instead. checkpoint.enableExternalizedCheckpoints() deprecated; replace with checkpointConf.setExternalizedCheckpointCleanup().

Precautions

Watch for deprecated APIs and configuration parameters during version migration.

Verify new configuration options; refer to the Flink configuration guide.

Prepare a rollback plan using savepoints before migrating jobs.

Finalize monitoring and runbook procedures prior to upgrade.

Perform the upgrade during low‑traffic windows and inform stakeholders.

Benefits After Upgrade

Flink CDC simplifies the data integration architecture and improves real‑time synchronization latency.

Unaligned checkpoints mitigate checkpoint failures under back‑pressure.

Richer, more concise Flink SQL boosts development efficiency.

Enhanced Flink‑on‑Hive support (≈96% DML/DDL compatibility) enables smooth migration of Hive SQL to Flink.

References

Apache Flink Documentation

Flink 1.14 Release Notes

Flink version upgrade guide

Youzan real‑time computing Flink 1.13 upgrade practice

Flink 1.10 to 1.12 upgrade benefit assessment (Zhihu)

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataFlinkVersion UpgradeYARNCDC
WeiLi Technology Team
Written by

WeiLi Technology Team

Practicing data-driven principles and believing technology can change the world.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.