Big Data 13 min read

How We Upgraded a 1500-Node Flink Cluster to 1.10: Challenges and Solutions

Facing a massive 1500‑node Flink 1.4.2 cluster handling over 12,000 tasks and 30 trillion daily events, we migrated to Flink 1.10, detailing new DDL/Catalog support, SQL enhancements, memory tuning, compatibility patches, extensive testing, and engine optimizations such as task‑load metrics and balanced sub‑task scheduling.

dbaplus Community

Jan 27, 2021

How We Upgraded a 1500-Node Flink Cluster to 1.10: Challenges and Solutions

Background

Before the upgrade the production environment ran Flink 1.4.2 with custom enhancements, offering StreamSQL and low‑level API services. The cluster comprised 1,500 physical machines, executing more than 12,000 jobs and processing roughly 30 trillion events per day.

New Features in Flink 1.10

1. Native DDL and Catalog Support

FlinkSQL now supports native DDL statements such as CREATE TABLE and CREATE FUNCTION, allowing metadata registration directly via SQL. An InMemoryCatalog stores metadata in memory by default, while a HiveCatalog enables integration with Hive Metastore, and users can implement custom catalogs for bespoke metadata management.

2. FlinkSQL Enhancements

Top‑N and deduplication syntax based on ROW_NUMBER, expanding StreamSQL use cases.

BinaryRow type for internal data exchange, reducing serialization overhead by deserializing only needed fields.

Many built‑in functions (string handling, FIRST_VALUE, LAST_VALUE, etc.) that run faster than user‑defined functions.

MiniBatch optimization to increase throughput through micro‑batch processing.

3. Memory Configuration Optimizations

Previous memory management, especially with RocksDB, was error‑prone because a single TaskManager could host multiple RocksDB instances, making memory estimation difficult and causing OOM kills. The new setting state.backend.rocksdb.memory.fixed-per-slot caps RocksDB memory per slot, eliminating most OOM incidents.

Challenges and Mitigations

1. Compatibility of Internal Patches

Large architectural gaps between versions meant many internal patches could not be merged directly. The team catalogued historic commits, selected essential changes, and re‑implemented them on the new version to preserve all existing functionality.

2. StreamSQL Syntax Compatibility

In Flink 1.4 the team used a custom Antlr‑based DDL parser. Flink 1.10 provides native DDL, but the syntax differs. Three options were considered: (a) abandon internal syntax, (b) modify the core parser, or (c) add a translation layer on top of the parser. The team chose option (c), creating a plug‑in that converts internal SQL to community‑compatible SQL, minimizing engine coupling and future‑proofing the solution.

Examples include adding support for the internal DDMQ message queue, custom formats like binlog, and new ADD JAR and SET statements for external dependencies and table configuration.

3. Compatibility Testing

The team designed a staged testing process to ensure smooth migration:

Conversion Test: unit‑style verification that all jobs convert correctly.

Compile Test: ensure TablePlanner can generate execution plans and JobGraphs without actual submission.

Regression Test: replay jobs in a test environment after configuration changes.

Comparison Test: run sampled data on both old and new versions, compare deterministic results to guarantee semantic equivalence.

Engine Enhancements

1. Task‑Load Metric

Previous back‑pressure metrics were coarse. Leveraging the new Mailbox thread model, the team measures thread occupancy time to compute a precise task‑load percentage, paving the way for resource‑recommendation based on load.

2. SubTask Balanced Scheduling

After FLIP‑6, Flink removed the --container flag and introduced on‑demand slot allocation, which could cause source tasks to concentrate on a few TaskManagers, creating hotspots. The team added a “minimum slot” configuration and the cluster.evenly-spread-out-slots parameter to evenly distribute SubTasks when slots are abundant.

3. Window Function Improvements

The default tumbling window TUMBLE(time_attr, INTERVAL '1' DAY) aligns to midnight, which cannot express a 12 pm‑to‑12 pm window. The team added an optional offset parameter, e.g., TUMBLE(time_attr, INTERVAL '1' DAY, TIME '12:00:00'). They also introduced a trigger interval argument, allowing windows to emit results every minute: TUMBLE(time_attr, INTERVAL '1' DAY, INTERVAL '1' MINUTE).

4. RexCall Result Reuse

Repeated JSON parsing in SQL queries caused three identical function calls due to planner‑generated projection chains. By storing the digest‑to‑variable mapping in CodeGenContext, identical deterministic calls ( isDeterministic=true) reuse the previously computed result, eliminating redundant work.

Conclusion

After several months of effort the upgraded Flink 1.10 cluster is live as the default StreamSQL engine, achieving a 99.9 % compatibility test pass rate and transparent upgrades for users. New SQL features and performance improvements have already enabled a variety of real‑time data‑warehouse workloads, and future work will focus on deeper Hive integration and broader batch‑processing capabilities.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization Big Data Flink SQL testing version upgrade

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Background

New Features in Flink 1.10

1. Native DDL and Catalog Support

2. FlinkSQL Enhancements

3. Memory Configuration Optimizations

Challenges and Mitigations

1. Compatibility of Internal Patches

2. StreamSQL Syntax Compatibility

3. Compatibility Testing

Engine Enhancements

1. Task‑Load Metric

2. SubTask Balanced Scheduling

3. Window Function Improvements

4. RexCall Result Reuse

Conclusion

dbaplus Community

How this landed with the community

Was this worth your time?

0 Comments

New Features in Flink 1.10