Upgrading DiDi Real‑time Computing Engine from Flink 1.4 to Flink 1.10: Challenges, Optimizations, and Lessons Learned
DiDi upgraded its massive real‑time computing engine from Flink 1.4.2 to Flink 1.10, implementing a transparent migration across 1500 machines, adding native DDL, binary rows, MiniBatch, improved scheduling and window functions, and establishing a rigorous testing pipeline that achieved 99.9 % compatibility while preventing OOM issues.
DiDi's real‑time computing engine was upgraded from Flink‑1.4.2 to Flink‑1.10, aiming for a seamless, user‑transparent migration while taking advantage of new metrics, scheduling, and SQL engine improvements.
Background : The original cluster consisted of 1500 physical machines, over 12 000 running tasks and a daily throughput of ~30 trillion records. With the integration of Blink into Flink master, many architectural upgrades were required, prompting the migration to the milestone Flink‑1.10 release.
Flink‑1.10 new features include native DDL and catalog support, BinaryRow internal data representation, extensive built‑in functions, MiniBatch optimization, and enhanced memory configuration (e.g., state.backend.rocksdb.memory.fixed-per-slot ) that mitigates OOM risks.
Challenges & solutions :
Ensuring StreamSQL compatibility – three approaches were evaluated; a syntax‑translation layer was chosen to keep internal DSL logic separate from the engine.
Compatibility testing – a staged testing pipeline (conversion, compilation, regression, and comparison tests) was built to achieve a 99.9 % pass rate.
Sub‑task load balancing – new configuration (minimum slot count and cluster.evenly-spread-out-slots ) together with the Mailbox thread model provides precise load metrics and more even distribution.
Window function enhancements – offset and trigger‑interval parameters were added to TUMBLE , enabling custom window offsets and continuous triggers.
RexCall result reuse – the planner now reuses identical deterministic function calls by caching the Digest‑to‑variable mapping in CodeGenContext.
Engine enhancements also cover task‑load metrics, sub‑task balanced scheduling, and window function extensions.
Summary : After several months of effort, the upgraded engine is now the default StreamSQL runtime, achieving transparent upgrades for users, supporting a wide range of business scenarios, and laying the groundwork for future integrations such as Hive batch processing.
Didi Tech
Official Didi technology account
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.