How We Upgraded a 1500-Node Flink Cluster to 1.10: Challenges and Solutions
Facing a massive 1500‑node Flink 1.4.2 cluster handling over 12,000 tasks and 30 trillion daily events, we migrated to Flink 1.10, detailing new DDL/Catalog support, SQL enhancements, memory tuning, compatibility patches, extensive testing, and engine optimizations such as task‑load metrics and balanced sub‑task scheduling.
Background
Before the upgrade the production environment ran Flink 1.4.2 with custom enhancements, offering StreamSQL and low‑level API services. The cluster comprised 1,500 physical machines, executing more than 12,000 jobs and processing roughly 30 trillion events per day.
New Features in Flink 1.10
1. Native DDL and Catalog Support
FlinkSQL now supports native DDL statements such as CREATE TABLE and CREATE FUNCTION, allowing metadata registration directly via SQL. An InMemoryCatalog stores metadata in memory by default, while a HiveCatalog enables integration with Hive Metastore, and users can implement custom catalogs for bespoke metadata management.
2. FlinkSQL Enhancements
Top‑N and deduplication syntax based on ROW_NUMBER, expanding StreamSQL use cases.
BinaryRow type for internal data exchange, reducing serialization overhead by deserializing only needed fields.
Many built‑in functions (string handling, FIRST_VALUE, LAST_VALUE, etc.) that run faster than user‑defined functions.
MiniBatch optimization to increase throughput through micro‑batch processing.
3. Memory Configuration Optimizations
Previous memory management, especially with RocksDB, was error‑prone because a single TaskManager could host multiple RocksDB instances, making memory estimation difficult and causing OOM kills. The new setting state.backend.rocksdb.memory.fixed-per-slot caps RocksDB memory per slot, eliminating most OOM incidents.
Challenges and Mitigations
1. Compatibility of Internal Patches
Large architectural gaps between versions meant many internal patches could not be merged directly. The team catalogued historic commits, selected essential changes, and re‑implemented them on the new version to preserve all existing functionality.
2. StreamSQL Syntax Compatibility
In Flink 1.4 the team used a custom Antlr‑based DDL parser. Flink 1.10 provides native DDL, but the syntax differs. Three options were considered: (a) abandon internal syntax, (b) modify the core parser, or (c) add a translation layer on top of the parser. The team chose option (c), creating a plug‑in that converts internal SQL to community‑compatible SQL, minimizing engine coupling and future‑proofing the solution.
Examples include adding support for the internal DDMQ message queue, custom formats like binlog, and new ADD JAR and SET statements for external dependencies and table configuration.
3. Compatibility Testing
The team designed a staged testing process to ensure smooth migration:
Conversion Test: unit‑style verification that all jobs convert correctly.
Compile Test: ensure TablePlanner can generate execution plans and JobGraphs without actual submission.
Regression Test: replay jobs in a test environment after configuration changes.
Comparison Test: run sampled data on both old and new versions, compare deterministic results to guarantee semantic equivalence.
Engine Enhancements
1. Task‑Load Metric
Previous back‑pressure metrics were coarse. Leveraging the new Mailbox thread model, the team measures thread occupancy time to compute a precise task‑load percentage, paving the way for resource‑recommendation based on load.
2. SubTask Balanced Scheduling
After FLIP‑6, Flink removed the --container flag and introduced on‑demand slot allocation, which could cause source tasks to concentrate on a few TaskManagers, creating hotspots. The team added a “minimum slot” configuration and the cluster.evenly-spread-out-slots parameter to evenly distribute SubTasks when slots are abundant.
3. Window Function Improvements
The default tumbling window TUMBLE(time_attr, INTERVAL '1' DAY) aligns to midnight, which cannot express a 12 pm‑to‑12 pm window. The team added an optional offset parameter, e.g., TUMBLE(time_attr, INTERVAL '1' DAY, TIME '12:00:00'). They also introduced a trigger interval argument, allowing windows to emit results every minute: TUMBLE(time_attr, INTERVAL '1' DAY, INTERVAL '1' MINUTE).
4. RexCall Result Reuse
Repeated JSON parsing in SQL queries caused three identical function calls due to planner‑generated projection chains. By storing the digest‑to‑variable mapping in CodeGenContext, identical deterministic calls ( isDeterministic=true) reuse the previously computed result, eliminating redundant work.
Conclusion
After several months of effort the upgraded Flink 1.10 cluster is live as the default StreamSQL engine, achieving a 99.9 % compatibility test pass rate and transparent upgrades for users. New SQL features and performance improvements have already enabled a variety of real‑time data‑warehouse workloads, and future work will focus on deeper Hive integration and broader batch‑processing capabilities.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
