How We Upgraded Our Flink Cluster from 1.10 to 1.14.6 and Overcame Common Pitfalls
This article details the background of a Flink 1.10 cluster on Huawei Cloud, the technical challenges that prompted an upgrade, a step‑by‑step migration plan to Flink 1.14.6, troubleshooting of frequent errors, precautionary measures, and the performance and operational benefits achieved after the upgrade.
Background
Micro Carp big data Flink cluster built on Huawei Cloud in 2020, version 1.10.0, handling real‑time log processing, metrics, ETL, recommendation, and ad bidding. Over 100 Flink jobs, 1800+ slots, processing more than one billion records and roughly 200 GB of data daily.
Motivation for Upgrade
Issues encountered: incomplete Flink SQL DDL support, checkpoint alignment failures under back‑pressure, insufficient connector support, desire to replace Canal+Kafka with Flink CDC, correctness/performance of dimension‑table joins, lack of Hive integration, and the need to stay on a newer release (latest 1.16).
Upgrade Plan
Feasibility assessment
Test environment validation
Deploy Flink 1.14.6 and verify capabilities
Integrate Flink CDC ecosystem
Conduct stress tests
…
Confirm rollback and job‑monitoring scheme
Review jobs, impact, and priority
Deploy Flink 1.14.6 to production
Upgrade job code
Gradual migration (gray release)
Continuous observation and issue handling
Pitfalls & Solutions
3.1 Flink yarn‑session creation failure
./yarn-session.sh -nm test -dError: A JNI error has occurred, NoClassDefFoundError for org/apache/hadoop/yarn/exceptions/YarnException.
Fix: add export HADOOP_CLASSPATH=`hadoop classpath` to /etc/profile and run source /etc/profile to apply.
3.2 Local mode start error
[root@node1 bin]# sh start-cluster.sh
/data/flink/flink-1.15.2/bin/config.sh: line 32: unexpected '<' syntax errorCause: the script uses Bash‑only syntax; run ./start-cluster.sh directly instead of using sh.
3.3 Hive DDL failure
Missing Hive connector jars; place the appropriate hive-common.jar, hive-exec.jar, etc., into $FLINK_HOME/lib and restart the SQL client.
Also adjust the flink-sql-connector-hive pom.xml to match the CDH‑Hive version, rebuild, and replace the jar.
Example DDL and error screenshot are shown in the original article.
3.4 NoClassDefFoundError / NoSuchMethodError
Typical reasons: missing jar or version conflict. Use the following command to locate the missing class or method:
ls *.jar | while read jarfile; do
echo "$jarfile"
jar -tf $jarfile | grep "MissingClassOrMethod"
doneIf the class exists, resolve any version conflict accordingly.
3.5 Zookeeper node quota exhausted
Check quota with:
# View quota configuration
bin/zkCli.sh -server xxx:2181 listquota /flink
# View current usage (quota reached 1000)
get /zookeeper/quota/flink/zookeeper_statsDelete excess nodes, e.g.:
deleteall /flink/application_1642060369182_4949589After cleanup, creating a new yarn‑session succeeds.
3.6 Direct buffer memory OOM
Increase off‑heap memory in flink-conf.yaml:
taskmanager.memory.framework.off-heap.size: 256m
taskmanager.memory.task.off-heap.size: 128m3.7 Netty RemoteTransportException
TaskManager containers were killed for exceeding physical memory limits. Raise the process memory size or adjust managed memory:
# Default 1728m
taskmanager.memory.process.size: 4096m3.8 Deprecated / removed methods
Examples of API changes in newer Flink versions: rocksDBStateBackend.enableTtlCompactionFilter() removed (TTL enabled by default). split and select removed; use side outputs instead. checkpoint.enableExternalizedCheckpoints() deprecated; replace with checkpointConf.setExternalizedCheckpointCleanup().
Precautions
Watch for deprecated APIs and configuration parameters during version migration.
Verify new configuration options; refer to the Flink configuration guide.
Prepare a rollback plan using savepoints before migrating jobs.
Finalize monitoring and runbook procedures prior to upgrade.
Perform the upgrade during low‑traffic windows and inform stakeholders.
Benefits After Upgrade
Flink CDC simplifies the data integration architecture and improves real‑time synchronization latency.
Unaligned checkpoints mitigate checkpoint failures under back‑pressure.
Richer, more concise Flink SQL boosts development efficiency.
Enhanced Flink‑on‑Hive support (≈96% DML/DDL compatibility) enables smooth migration of Hive SQL to Flink.
References
Apache Flink Documentation
Flink 1.14 Release Notes
Flink version upgrade guide
Youzan real‑time computing Flink 1.13 upgrade practice
Flink 1.10 to 1.12 upgrade benefit assessment (Zhihu)
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
WeiLi Technology Team
Practicing data-driven principles and believing technology can change the world.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
