Big Data 29 min read

Understanding Flink CDC 2.0: Core Design, Snapshot & Incremental Reading, and Code Walkthrough

This article introduces Flink CDC 2.0, explains its distributed full‑load and incremental reading mechanisms, details the slice partitioning, snapshot correction, and binlog handling logic, and provides a complete Java example that demonstrates how to configure Flink SQL, MySQL source, and Kafka sink.

Big Data Technology & Architecture

Nov 8, 2021

Understanding Flink CDC 2.0: Core Design, Snapshot & Incremental Reading, and Code Walkthrough

In August, Flink CDC released version 2.0.0, adding distributed full‑load reading, checkpoint support, and lock‑free consistency guarantees for combined full and incremental reads.

The article first presents a Flink SQL example that creates a MySQL CDC source table and a Kafka sink table, then shows the resulting changelog JSON output.

public static void main(String[] args) {
    StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
    EnvironmentSettings envSettings = EnvironmentSettings.newInstance()
            .useBlinkPlanner()
            .inStreamingMode()
            .build();
    env.setParallelism(3);
    env.enableCheckpointing(10000);
    StreamTableEnvironment tableEnvironment = StreamTableEnvironment.create(env, envSettings);
    tableEnvironment.executeSql(" CREATE TABLE demoOrders (
        `order_id` INTEGER,
        `order_date` DATE,
        `order_time` TIMESTAMP(3),
        `quantity` INT,
        `product_id` INT,
        `purchaser` STRING,
        primary key(order_id) NOT ENFORCED
    ) WITH (
        'connector' = 'mysql-cdc',
        'hostname' = 'localhost',
        'port' = '3306',
        'username' = 'cdc',
        'password' = '123456',
        'database-name' = 'test',
        'table-name' = 'demo_orders',
        'scan.startup.mode' = 'initial'
    )");
    tableEnvironment.executeSql("CREATE TABLE sink (
        `order_id` INTEGER,
        `order_date` DATE,
        `order_time` TIMESTAMP(3),
        `quantity` INT,
        `product_id` INT,
        `purchaser` STRING,
        primary key (order_id) NOT ENFORCED
    ) WITH (
        'connector' = 'kafka',
        'properties.bootstrap.servers' = 'localhost:9092',
        'topic' = 'mqTest02',
        'format' = 'changelog-json'
    )");
    tableEnvironment.executeSql("insert into sink select * from demoOrders");
}

The core design of Flink CDC 2.0 revolves around slice (chunk) partitioning, which can be uniform (auto‑increment integer primary key) or non‑uniform (non‑numeric or non‑auto‑increment keys). Uniform slices are calculated by min/max values and a configurable chunk size, while non‑uniform slices repeatedly query the next maximum key.

During the full‑load phase, each slice is read in parallel without table locks. To guarantee consistency, Flink records a snapshot of the table, captures the current binlog position (SHOW MASTER STATUS), and then replays binlog events that affect the slice, correcting the snapshot data as needed.

After all full‑load slices finish, the MySqlHybridSplitAssigner creates a BinlogSplit for incremental reading. The split’s starting offset is the smallest high‑watermark among completed slices, ensuring no duplicate or missed events.

Key components: MySqlSourceEnumerator – creates and coordinates MySqlSourceReader instances, assigns snapshot and binlog splits, and reports finished splits with their high watermarks. MySqlSourceReader – receives split assignments, creates SnapshotSplitReader or BinlogSplitReader, and forwards records to the MySqlRecordEmitter. SnapshotSplitReader – executes SQL queries for each slice, captures binlog offsets before and after the snapshot, and emits corrected records. BinlogSplitReader – reads binlog events from the calculated start offset, emitting only events whose offset exceeds the high‑watermark of the slice they belong to. MySqlRecordEmitter – converts Debezium SourceRecord objects to Flink RowData, handling insert, delete, and update events via RowDataDebeziumDeserializeSchema.

The article also shows how the enumerator reports finished snapshot splits to the coordinator, how the split assigner decides between snapshot and binlog splits, and how the fetcher manager schedules split fetch tasks.

Overall, the piece provides a deep dive into Flink CDC 2.0’s architecture, illustrating both the high‑level data flow and the low‑level Java implementation details.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Flink Streaming mysql Data Integration CDC Debezium

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.