Big Data 24 min read

Flink CDC 2.0: Concepts, Architecture, and Hands‑On Implementation

This article introduces the fundamentals of Flink CDC, explains its application scenarios and underlying technologies, compares query‑based and log‑based CDC, showcases open‑source solutions, and provides detailed Java and SQL examples for building real‑time ETL pipelines with MySQL and Flink.

Big Data Technology & Architecture

Mar 8, 2022

Flink CDC 2.0: Concepts, Architecture, and Hands‑On Implementation

CDC (Change Data Capture) refers to any technology that can capture data changes; in practice it mainly targets database modifications.

Application Scenarios

Data synchronization for backup and disaster recovery, data distribution to multiple downstream systems, and data collection for ETL into data warehouses or lakes.

CDC Techniques

Two mainstream mechanisms exist:

Query‑based CDC – offline batch queries, no real‑time guarantees, and limited consistency.

Log‑based CDC – real‑time log consumption, strong consistency, and low latency.

Open‑Source CDC Solutions

Several diagrams illustrate popular open‑source connectors (e.g., Debezium, Flink CDC).

Flink CDC 2.0 Design Details

The article links to the official GitHub repository and shows supported connectors.

Practical Maven Setup

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <parent>
        <artifactId>Flink-learning</artifactId>
        <groupId>com.wudl.flink</groupId>
        <version>1.0-SNAPSHOT</version>
    </parent>
    <modelVersion>4.0.0</modelVersion>
    <artifactId>Flink-cdc2.0</artifactId>
    <properties>
        <flink-version>1.13.0</flink-version>
        <maven.compiler.source>1.8</maven.compiler.source>
        <maven.compiler.target>1.8</maven.compiler.target>
    </properties>
    <dependencies>
        ... (MySQL, Flink, Ververica CDC, FastJSON, etc.) ...
    </dependencies>
    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-assembly-plugin</artifactId>
                <version>3.0.0</version>
                <configuration>
                    <descriptorRefs>
                        <descriptorRef>jar-with-dependencies</descriptorRef>
                    </descriptorRefs>
                </configuration>
                <executions>
                    <execution>
                        <id>make-assembly</id>
                        <phase>package</phase>
                        <goals>
                            <goal>single</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>
</project>

Java Job Example

package com.wud.cdc2;

import com.ververica.cdc.connectors.mysql.MySqlSource;
import com.ververica.cdc.connectors.mysql.table.StartupOptions;
import com.ververica.cdc.debezium.DebeziumSourceFunction;
import org.apache.flink.streaming.api.CheckpointingMode;
import org.apache.flink.streaming.api.environment.CheckpointConfig;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.datastream.DataStreamSource;

public class FlinkCDC {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(1);
        env.enableCheckpointing(5000L);
        env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
        env.getCheckpointConfig().enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
        env.setStateBackend(new FsStateBackend("hdfs://192.168.1.161:8020/cdc2.0-test/ck"));
        System.setProperty("HADOOP_USER_NAME", "hdfs");

        DebeziumSourceFunction<String> mySqlSource = MySqlSource.<String>builder()
                .hostname("192.168.1.180")
                .port(3306)
                .username("root")
                .password("123456")
                .databaseList("test")
                .tableList("test.Flink_iceberg")
                .deserializer(new StringDebeziumDeserializationSchema())
                .startupOptions(StartupOptions.initial())
                .build();
        DataStreamSource<String> ds = env.addSource(mySqlSource);
        ds.print();
        env.execute();
    }
}

The job prints captured change records; sample output shows SourceRecord and ConnectRecord structures.

Cluster Submission & Savepoint

Commands for submitting the job, creating a savepoint, and restarting from the savepoint are provided.

# bin/flink run -c com.wud.cdc2.FlinkCDC /opt/datas/Flink-cdc2.0-1.0-SNAPSHOT-jar-with-dependencies.jar
# bin/flink savepoint e8e918c2517a777e817c630cf1d6b932 hdfs://192.168.1.161:8020/cdc-test/savepoint
# bin/flink run -s hdfs://192.168.1.161:8020/cdc-test/savepoint/savepoint-e8e918-9ef094f349be -c com.wud.cdc2.FlinkCDC /opt/datas/Flink-cdc2.0-1.0-SNAPSHOT-jar-with-dependencies.jar

Custom Deserialization

A user‑defined CustomerDeserialization class extracts database, table, before/after fields, and operation type into a JSON string.

public class CustomerDeserialization implements DebeziumDeserializationSchema<String> {
    @Override
    public void deserialize(SourceRecord sourceRecord, Collector<String> collector) throws Exception {
        JSONObject result = new JSONObject();
        String[] fields = sourceRecord.topic().split("\\.");
        result.put("db", fields[1]);
        result.put("tableName", fields[2]);
        // extract before and after structs into JSON objects
        // ... (omitted for brevity) ...
        collector.collect(result.toJSONString());
    }
    @Override
    public TypeInformation<String> getProducedType() {
        return BasicTypeInfo.STRING_TYPE_INFO;
    }
}

The job is then updated to use this deserializer, and console output shows JSON records with "op", "before", "after", "db", and "tableName" fields.

Flink CDC 2.0 SQL ETL

Using Flink SQL, a CDC source table and a JDBC sink table are defined; the pipeline inserts data from MySQL to another MySQL table. Primary keys are required to avoid duplicate rows on updates.

CREATE TABLE mySqlSource (
    id BIGINT PRIMARY KEY,
    name STRING,
    age INT,
    dt STRING
) WITH (
    'connector' = 'mysql-cdc',
    'scan.startup.mode' = 'latest-offset',
    'hostname' = '192.168.1.180',
    'port' = '3306',
    'username' = 'root',
    'password' = '123456',
    'database-name' = 'test',
    'table-name' = 'Flink_iceberg'
);

CREATE TABLE mySqlSink (
    id BIGINT PRIMARY KEY,
    name STRING,
    age INT,
    dt STRING
) WITH (
    'connector' = 'jdbc',
    'url' = 'jdbc:mysql://192.168.1.180:3306/test',
    'table-name' = 'Flink_iceberg-cdc',
    'username' = 'root',
    'password' = '123456'
);

INSERT INTO mySqlSink SELECT * FROM mySqlSource;

Running the job produces a series of JSON messages indicating reads, updates, and creates, confirming that CDC captures and forwards changes correctly.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Java Apache Flink Streaming MySQL ETL Flink CDC Change Data Capture

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.