Flink CDC 2.0: Concepts, Architecture, and Hands‑On Implementation
This article introduces the fundamentals of Flink CDC, explains its application scenarios and underlying technologies, compares query‑based and log‑based CDC, showcases open‑source solutions, and provides detailed Java and SQL examples for building real‑time ETL pipelines with MySQL and Flink.
CDC (Change Data Capture) refers to any technology that can capture data changes; in practice it mainly targets database modifications.
Application Scenarios
Data synchronization for backup and disaster recovery, data distribution to multiple downstream systems, and data collection for ETL into data warehouses or lakes.
CDC Techniques
Two mainstream mechanisms exist:
Query‑based CDC – offline batch queries, no real‑time guarantees, and limited consistency.
Log‑based CDC – real‑time log consumption, strong consistency, and low latency.
Open‑Source CDC Solutions
Several diagrams illustrate popular open‑source connectors (e.g., Debezium, Flink CDC).
Flink CDC 2.0 Design Details
The article links to the official GitHub repository and shows supported connectors.
Practical Maven Setup
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<parent>
<artifactId>Flink-learning</artifactId>
<groupId>com.wudl.flink</groupId>
<version>1.0-SNAPSHOT</version>
</parent>
<modelVersion>4.0.0</modelVersion>
<artifactId>Flink-cdc2.0</artifactId>
<properties>
<flink-version>1.13.0</flink-version>
<maven.compiler.source>1.8</maven.compiler.source>
<maven.compiler.target>1.8</maven.compiler.target>
</properties>
<dependencies>
... (MySQL, Flink, Ververica CDC, FastJSON, etc.) ...
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<version>3.0.0</version>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>Java Job Example
package com.wud.cdc2;
import com.ververica.cdc.connectors.mysql.MySqlSource;
import com.ververica.cdc.connectors.mysql.table.StartupOptions;
import com.ververica.cdc.debezium.DebeziumSourceFunction;
import org.apache.flink.streaming.api.CheckpointingMode;
import org.apache.flink.streaming.api.environment.CheckpointConfig;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
public class FlinkCDC {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
env.enableCheckpointing(5000L);
env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
env.getCheckpointConfig().enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
env.setStateBackend(new FsStateBackend("hdfs://192.168.1.161:8020/cdc2.0-test/ck"));
System.setProperty("HADOOP_USER_NAME", "hdfs");
DebeziumSourceFunction<String> mySqlSource = MySqlSource.<String>builder()
.hostname("192.168.1.180")
.port(3306)
.username("root")
.password("123456")
.databaseList("test")
.tableList("test.Flink_iceberg")
.deserializer(new StringDebeziumDeserializationSchema())
.startupOptions(StartupOptions.initial())
.build();
DataStreamSource<String> ds = env.addSource(mySqlSource);
ds.print();
env.execute();
}
}The job prints captured change records; sample output shows SourceRecord and ConnectRecord structures.
Cluster Submission & Savepoint
Commands for submitting the job, creating a savepoint, and restarting from the savepoint are provided.
# bin/flink run -c com.wud.cdc2.FlinkCDC /opt/datas/Flink-cdc2.0-1.0-SNAPSHOT-jar-with-dependencies.jar
# bin/flink savepoint e8e918c2517a777e817c630cf1d6b932 hdfs://192.168.1.161:8020/cdc-test/savepoint
# bin/flink run -s hdfs://192.168.1.161:8020/cdc-test/savepoint/savepoint-e8e918-9ef094f349be -c com.wud.cdc2.FlinkCDC /opt/datas/Flink-cdc2.0-1.0-SNAPSHOT-jar-with-dependencies.jarCustom Deserialization
A user‑defined CustomerDeserialization class extracts database, table, before/after fields, and operation type into a JSON string.
public class CustomerDeserialization implements DebeziumDeserializationSchema<String> {
@Override
public void deserialize(SourceRecord sourceRecord, Collector<String> collector) throws Exception {
JSONObject result = new JSONObject();
String[] fields = sourceRecord.topic().split("\\.");
result.put("db", fields[1]);
result.put("tableName", fields[2]);
// extract before and after structs into JSON objects
// ... (omitted for brevity) ...
collector.collect(result.toJSONString());
}
@Override
public TypeInformation<String> getProducedType() {
return BasicTypeInfo.STRING_TYPE_INFO;
}
}The job is then updated to use this deserializer, and console output shows JSON records with "op", "before", "after", "db", and "tableName" fields.
Flink CDC 2.0 SQL ETL
Using Flink SQL, a CDC source table and a JDBC sink table are defined; the pipeline inserts data from MySQL to another MySQL table. Primary keys are required to avoid duplicate rows on updates.
CREATE TABLE mySqlSource (
id BIGINT PRIMARY KEY,
name STRING,
age INT,
dt STRING
) WITH (
'connector' = 'mysql-cdc',
'scan.startup.mode' = 'latest-offset',
'hostname' = '192.168.1.180',
'port' = '3306',
'username' = 'root',
'password' = '123456',
'database-name' = 'test',
'table-name' = 'Flink_iceberg'
);
CREATE TABLE mySqlSink (
id BIGINT PRIMARY KEY,
name STRING,
age INT,
dt STRING
) WITH (
'connector' = 'jdbc',
'url' = 'jdbc:mysql://192.168.1.180:3306/test',
'table-name' = 'Flink_iceberg-cdc',
'username' = 'root',
'password' = '123456'
);
INSERT INTO mySqlSink SELECT * FROM mySqlSource;Running the job produces a series of JSON messages indicating reads, updates, and creates, confirming that CDC captures and forwards changes correctly.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
