How We Upgraded Our Flink Cluster from 1.10 to 1.14.6 and Overcame Common Pitfalls
This article details the background of a Flink 1.10 cluster on Huawei Cloud, the technical challenges that prompted an upgrade, a step‑by‑step migration plan to Flink 1.14.6, troubleshooting of frequent errors, precautionary measures, and the performance and operational benefits achieved after the upgrade.
Background
Micro Carp big data Flink cluster built on Huawei Cloud in 2020, version 1.10.0, handling real‑time log processing, metrics, ETL, recommendation, and ad bidding. Over 100 Flink jobs, 1800+ slots, processing more than one billion records and roughly 200 GB of data daily.
Motivation for Upgrade
Issues encountered: incomplete Flink SQL DDL support, checkpoint alignment failures under back‑pressure, insufficient connector support, desire to replace Canal+Kafka with Flink CDC, correctness/performance of dimension‑table joins, lack of Hive integration, and the need to stay on a newer release (latest 1.16).
Upgrade Plan
Feasibility assessment
Test environment validation
Deploy Flink 1.14.6 and verify capabilities
Integrate Flink CDC ecosystem
Conduct stress tests
…
Confirm rollback and job‑monitoring scheme
Review jobs, impact, and priority
Deploy Flink 1.14.6 to production
Upgrade job code
Gradual migration (gray release)
Continuous observation and issue handling
Pitfalls & Solutions
3.1 Flink yarn‑session creation failure
<code>./yarn-session.sh -nm test -d</code>Error: A JNI error has occurred, NoClassDefFoundError for
org/apache/hadoop/yarn/exceptions/YarnException.
Fix: add
export HADOOP_CLASSPATH=`hadoop classpath`to
/etc/profileand run
source /etc/profileto apply.
3.2 Local mode start error
<code>[root@node1 bin]# sh start-cluster.sh
/data/flink/flink-1.15.2/bin/config.sh: line 32: unexpected '<' syntax error</code>Cause: the script uses Bash‑only syntax; run
./start-cluster.shdirectly instead of using
sh.
3.3 Hive DDL failure
Missing Hive connector jars; place the appropriate
hive-common.jar,
hive-exec.jar, etc., into
$FLINK_HOME/liband restart the SQL client.
Also adjust the
flink-sql-connector-hivepom.xml to match the CDH‑Hive version, rebuild, and replace the jar.
Example DDL and error screenshot are shown in the original article.
3.4 NoClassDefFoundError / NoSuchMethodError
Typical reasons: missing jar or version conflict. Use the following command to locate the missing class or method:
<code>ls *.jar | while read jarfile; do
echo "$jarfile"
jar -tf $jarfile | grep "MissingClassOrMethod"
done</code>If the class exists, resolve any version conflict accordingly.
3.5 Zookeeper node quota exhausted
Check quota with:
<code># View quota configuration
bin/zkCli.sh -server xxx:2181 listquota /flink
# View current usage (quota reached 1000)
get /zookeeper/quota/flink/zookeeper_stats</code>Delete excess nodes, e.g.:
<code>deleteall /flink/application_1642060369182_4949589</code>After cleanup, creating a new yarn‑session succeeds.
3.6 Direct buffer memory OOM
Increase off‑heap memory in
flink-conf.yaml:
<code>taskmanager.memory.framework.off-heap.size: 256m
taskmanager.memory.task.off-heap.size: 128m</code>3.7 Netty RemoteTransportException
TaskManager containers were killed for exceeding physical memory limits. Raise the process memory size or adjust managed memory:
<code># Default 1728m
taskmanager.memory.process.size: 4096m</code>3.8 Deprecated / removed methods
Examples of API changes in newer Flink versions:
rocksDBStateBackend.enableTtlCompactionFilter()removed (TTL enabled by default).
splitand
selectremoved; use side outputs instead.
checkpoint.enableExternalizedCheckpoints()deprecated; replace with
checkpointConf.setExternalizedCheckpointCleanup().
Precautions
Watch for deprecated APIs and configuration parameters during version migration.
Verify new configuration options; refer to the Flink configuration guide.
Prepare a rollback plan using savepoints before migrating jobs.
Finalize monitoring and runbook procedures prior to upgrade.
Perform the upgrade during low‑traffic windows and inform stakeholders.
Benefits After Upgrade
Flink CDC simplifies the data integration architecture and improves real‑time synchronization latency.
Unaligned checkpoints mitigate checkpoint failures under back‑pressure.
Richer, more concise Flink SQL boosts development efficiency.
Enhanced Flink‑on‑Hive support (≈96% DML/DDL compatibility) enables smooth migration of Hive SQL to Flink.
References
Apache Flink Documentation
Flink 1.14 Release Notes
Flink version upgrade guide
Youzan real‑time computing Flink 1.13 upgrade practice
Flink 1.10 to 1.12 upgrade benefit assessment (Zhihu)
WeiLi Technology Team
Practicing data-driven principles and believing technology can change the world.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.