Big Data 19 min read

How We Upgraded Our Flink Cluster from 1.10 to 1.14.6 and Overcame Common Pitfalls

This article details the background of a Flink 1.10 cluster on Huawei Cloud, the technical challenges that prompted an upgrade, a step‑by‑step migration plan to Flink 1.14.6, troubleshooting of frequent errors, precautionary measures, and the performance and operational benefits achieved after the upgrade.

WeiLi Technology Team
WeiLi Technology Team
WeiLi Technology Team
How We Upgraded Our Flink Cluster from 1.10 to 1.14.6 and Overcame Common Pitfalls

Background

Micro Carp big data Flink cluster built on Huawei Cloud in 2020, version 1.10.0, handling real‑time log processing, metrics, ETL, recommendation, and ad bidding. Over 100 Flink jobs, 1800+ slots, processing more than one billion records and roughly 200 GB of data daily.

Motivation for Upgrade

Issues encountered: incomplete Flink SQL DDL support, checkpoint alignment failures under back‑pressure, insufficient connector support, desire to replace Canal+Kafka with Flink CDC, correctness/performance of dimension‑table joins, lack of Hive integration, and the need to stay on a newer release (latest 1.16).

Upgrade Plan

Feasibility assessment

Test environment validation

Deploy Flink 1.14.6 and verify capabilities

Integrate Flink CDC ecosystem

Conduct stress tests

Confirm rollback and job‑monitoring scheme

Review jobs, impact, and priority

Deploy Flink 1.14.6 to production

Upgrade job code

Gradual migration (gray release)

Continuous observation and issue handling

Pitfalls & Solutions

3.1 Flink yarn‑session creation failure

<code>./yarn-session.sh -nm test -d</code>

Error: A JNI error has occurred, NoClassDefFoundError for

org/apache/hadoop/yarn/exceptions/YarnException

.

Fix: add

export HADOOP_CLASSPATH=`hadoop classpath`

to

/etc/profile

and run

source /etc/profile

to apply.

3.2 Local mode start error

<code>[root@node1 bin]# sh start-cluster.sh
/data/flink/flink-1.15.2/bin/config.sh: line 32: unexpected '&lt;' syntax error</code>

Cause: the script uses Bash‑only syntax; run

./start-cluster.sh

directly instead of using

sh

.

3.3 Hive DDL failure

Missing Hive connector jars; place the appropriate

hive-common.jar

,

hive-exec.jar

, etc., into

$FLINK_HOME/lib

and restart the SQL client.

Also adjust the

flink-sql-connector-hive

pom.xml to match the CDH‑Hive version, rebuild, and replace the jar.

Example DDL and error screenshot are shown in the original article.

3.4 NoClassDefFoundError / NoSuchMethodError

Typical reasons: missing jar or version conflict. Use the following command to locate the missing class or method:

<code>ls *.jar | while read jarfile; do
  echo "$jarfile"
  jar -tf $jarfile | grep "MissingClassOrMethod"
done</code>

If the class exists, resolve any version conflict accordingly.

3.5 Zookeeper node quota exhausted

Check quota with:

<code># View quota configuration
bin/zkCli.sh -server xxx:2181 listquota /flink
# View current usage (quota reached 1000)
get /zookeeper/quota/flink/zookeeper_stats</code>

Delete excess nodes, e.g.:

<code>deleteall /flink/application_1642060369182_4949589</code>

After cleanup, creating a new yarn‑session succeeds.

3.6 Direct buffer memory OOM

Increase off‑heap memory in

flink-conf.yaml

:

<code>taskmanager.memory.framework.off-heap.size: 256m
taskmanager.memory.task.off-heap.size: 128m</code>

3.7 Netty RemoteTransportException

TaskManager containers were killed for exceeding physical memory limits. Raise the process memory size or adjust managed memory:

<code># Default 1728m
taskmanager.memory.process.size: 4096m</code>

3.8 Deprecated / removed methods

Examples of API changes in newer Flink versions:

rocksDBStateBackend.enableTtlCompactionFilter()

removed (TTL enabled by default).

split

and

select

removed; use side outputs instead.

checkpoint.enableExternalizedCheckpoints()

deprecated; replace with

checkpointConf.setExternalizedCheckpointCleanup()

.

Precautions

Watch for deprecated APIs and configuration parameters during version migration.

Verify new configuration options; refer to the Flink configuration guide.

Prepare a rollback plan using savepoints before migrating jobs.

Finalize monitoring and runbook procedures prior to upgrade.

Perform the upgrade during low‑traffic windows and inform stakeholders.

Benefits After Upgrade

Flink CDC simplifies the data integration architecture and improves real‑time synchronization latency.

Unaligned checkpoints mitigate checkpoint failures under back‑pressure.

Richer, more concise Flink SQL boosts development efficiency.

Enhanced Flink‑on‑Hive support (≈96% DML/DDL compatibility) enables smooth migration of Hive SQL to Flink.

References

Apache Flink Documentation

Flink 1.14 Release Notes

Flink version upgrade guide

Youzan real‑time computing Flink 1.13 upgrade practice

Flink 1.10 to 1.12 upgrade benefit assessment (Zhihu)

Big DataFlinktroubleshootingVersion UpgradeYARNCDC
WeiLi Technology Team
Written by

WeiLi Technology Team

Practicing data-driven principles and believing technology can change the world.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.