Big Data 14 min read

How to Migrate State in Flink SQL Jobs with DAG Visualization and UID Mapping

This article explains why preserving state during Flink SQL job iterations is crucial, analyzes the challenges caused by DAG changes and serializer incompatibility, and presents a visual‑preview, UID editing, and automatic mapping solution to enable reliable state migration for streaming workloads.

Volcano Engine Developer Services

Feb 28, 2023

How to Migrate State in Flink SQL Jobs with DAG Visualization and UID Mapping

Background

Flink SQL is an important tool for building real‑time data warehouses. It lets users develop streaming jobs quickly by writing SQL, avoiding the complexity of DataStream programs, packaging, and deployment. However, this abstraction hides low‑level details, reducing code‑level flexibility.

State Migration Importance

Streaming jobs often maintain intermediate results as state (e.g., per‑minute order totals). When a job is iterated, preserving this state is essential for continuity. Without state migration, the old state is discarded, the job offset is rolled back, and the job is re‑executed, which wastes resources, introduces latency, and may fail for long‑window aggregations.

Current Situation

State recovery requires two conditions: (1) consistent OperatorID, which binds state to the operator, and (2) compatible State Serializer. In DataStream jobs users can set UID/UIDHash and custom serializers to satisfy these conditions. In SQL jobs the operator layer is hidden, making state a black box; changes to the SQL often break the two conditions, preventing migration.

Problem Classification

SQL job state migration issues fall into two categories:

DAG changes are easy.

State serializer incompatibility.

Issue 1: DAG Changes

Implicit modifications (e.g., adding a constant, enabling minibatch, adding watermarks) or explicit modifications (adding dimension tables, sources, sinks) introduce new nodes in the DAG, causing OperatorID changes and state loss.

Issue 2: Serializer Incompatibility

Even with the same Flink version, modifications to SQL can change the schema of stored RowData (e.g., adding a column to a groupAggregate valueState), making the old and new serializers incompatible.

Solution Overview

We propose a three‑step approach to bring the DataStream UID capability to SQL jobs:

Provide a visual preview of the SQL job DAG.

Allow users to edit operator attributes (UID/UIDHash) in the UI.

Pass the edited UID information to the runtime.

DAG Preview and PlanGraph

We introduce a PlanGraph abstraction that maps StreamGraph properties to a stable ID using StreamGraphHasherV2. The UI shows a task‑level DAG and, upon expansion, the operator chain with attributes such as operator ID, name, and parallelism.

Overall Workflow

Generate a PlanGraph for each job version (SQL + configs) and store it.

When the job is modified, generate a new PlanGraph.

Diff the old and new PlanGraphs; manually or automatically map old operator IDs to the new graph.

Submit the updated PlanGraph together with SQL and configs to Flink; the operator IDs are injected into the JobGraph.

Usability Enhancements

Because manually configuring UID for every operator is costly, we add:

Best‑effort automatic mapping of matching operators based on identical descriptions.

Highlighting of stateful nodes.

JSON code comparison of operator lists to locate remaining mismatches.

Best‑Effort Automatic Mapping Algorithm

Collect operators with identical descriptions in both graphs, compute similarity based on input/output node attributes, push pairs into a max‑heap, and iteratively assign the old Generated OperatorID to the new User‑Provided Hash.

Future Work

We plan to build a comprehensive SQL job state‑recovery capability map, including pre‑check for non‑recoverable states and enhancements for the two typical problem categories. This involves exploring eager state declaration (FLIP‑22) and SQL hints for readable UID assignment, as well as column‑lineage‑based state schema evolution.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Flink SQL State Migration DAG Visualization OperatorID

Written by

Volcano Engine Developer Services

The Volcano Engine Developer Community, Volcano Engine's TOD community, connects the platform with developers, offering cutting-edge tech content and diverse events, nurturing a vibrant developer culture, and co-building an open-source ecosystem.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.