Big Data 13 min read

Introduction to Flink CDC: Incremental Snapshot Algorithm and Framework

This article introduces Flink CDC, explains its incremental snapshot algorithm and the 2.0 framework design, compares it with traditional CDC pipelines, discusses the core API and dialect concept, and outlines community growth and future plans, providing a comprehensive technical overview for data engineers.

DataFunTalk
DataFunTalk
DataFunTalk
Introduction to Flink CDC: Incremental Snapshot Algorithm and Framework

Overview Flink CDC is a change‑data‑capture (CDC) technology built on database logs that provides unified full‑load and incremental reading capabilities, delivering a consistent snapshot of a table within Flink without requiring users to handle the merging of historical and real‑time data.

Why Flink CDC simplifies ETL Traditional pipelines use tools like Canal or Debezium to write change logs to Kafka, then Flink reads both historical and incremental data and merges them. With Flink CDC, the pipeline shortens dramatically because Flink CDC reads directly from the upstream source and leverages Flink’s processing engine for collection, computation, and sink.

Supported APIs Both the SQL API (easy to use, low entry barrier) and the DataStream API (highly extensible, custom development) are supported, allowing users to choose the most suitable approach for their use case.

Incremental Snapshot Algorithm – Problems in CDC 1.0 CDC 1.0 suffered from three major issues: (1) consistency was ensured by locking, which is unfriendly to business workloads; (2) no horizontal scalability – full‑load runs with a single parallelism, causing long runtimes for large tables; (3) lack of checkpoint support during full‑load, forcing a restart from the beginning on failure.

Lock analysis and redesign in CDC 2.0 CDC 2.0 removes locks, introduces lock‑free reading, supports high‑parallelism horizontal scaling, and adds checkpoint‑based resumable reads. The design adopts a variant of the DBLog algorithm combined with the FLIP‑27 source implementation.

DBLog algorithm basics The algorithm splits a table into multiple chunks (key ranges). Each chunk is assigned to a parallel reader. During the snapshot phase, each reader reads the full data of its chunk while simultaneously capturing binlog events that occur within the chunk, then merges them to produce the latest consistent view at the snapshot’s close point.

Chunk processing workflow 1. Chunk division: based on the primary‑key range (min‑max) and a configurable step size, the table is partitioned into left‑closed, right‑open intervals. 2. Chunk distribution: a SourceEnumerator distributes chunks to parallel SourceReaders, enabling horizontal scaling. 3. Snapshot reporting: each reader reports its chunk’s close offset. 4. Binlog catch‑up: after all snapshots finish, a special Binlog chunk aggregates the reports; the BinlogReader then jumps to the appropriate offsets to avoid duplicate data and proceeds with pure binlog consumption.

Core value of the incremental snapshot algorithm • Parallel, horizontally scalable reads for massive tables. • Lock‑free consistency via the DBLog‑style algorithm. • Checkpoint‑driven resumable reads; failed tasks can be retried independently. • Seamless full‑incremental integration without manual switching.

Incremental Snapshot Framework The framework abstracts common logic into a DataSourceDialect API, allowing different data sources to plug in by implementing source‑specific methods. For JDBC sources, a JdbcDataSourceDialect is provided, and implementing a MySQLDialect is sufficient to integrate MySQL into the framework.

Community development Since the release of version 2.0, the Flink CDC community has grown to over 6,000 developers, with contributions from many leading internet companies. Future plans focus on framework promotion, ecosystem integration, and improving usability.

Q&A Highlights • Chunk count is determined by a default target size (~8 KB) and can be calculated from min/max primary keys or discovered lazily for non‑uniform keys. • Incremental merge is performed in‑memory with upserts, and can be parallelised across many tasks. • Currently only a single reader handles the binlog phase, as most upstream logs are single‑file streams.

Big DataApache FlinkFlink CDCChange Data CaptureData StreamingIncremental Snapshot
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.