Big Data 18 min read

Real-time CDC Data Read/Write Solutions in Data Lake Architecture with Flink and Iceberg

This article, compiled by community volunteers, examines various CDC data real‑time read/write solutions for data lake architectures, comparing offline HBase, Apache Kudu, Hive, Spark + Delta, and ultimately advocating Flink + Iceberg for efficient, correct, and scalable streaming ingestion and analytics.

Big Data Technology Architecture
Big Data Technology Architecture
Big Data Technology Architecture
Real-time CDC Data Read/Write Solutions in Data Lake Architecture with Flink and Iceberg

Abstract

This article, authored by Li Jinsong and Hu Zheng and organized by community volunteers Yang Weihai and Li Peidian, introduces real‑time CDC data read/write solutions within data‑lake architectures. It is divided into four parts: common CDC analysis schemes, reasons for choosing Flink + Iceberg, how to write and read in real time, and future plans.

1. Common CDC Analysis Schemes

Offline HBase cluster analysis of CDC data – HBase provides low‑latency point queries but is unsuitable for OLAP because it uses row‑store indexing, has high maintenance cost, and stores data in HFile, which does not integrate well with columnar formats such as Parquet.

Apache Kudu – adds column‑store capabilities to HBase‑like point queries, improving OLAP performance, but suffers from higher maintenance cost, limited batch‑scan speed, weak delete support, and lack of incremental pull.

Direct import of CDC into Hive – uses full‑partition plus daily incremental partitions with a merge step; however, it introduces latency (typically T+1), requires full data rewrite on each merge, and is resource‑intensive.

Spark + Delta – leverages Delta’s MERGE INTO syntax to rewrite only changed files, offering faster incremental updates compared with Hive.

2. Why Choose Flink + Iceberg

2.1 Flink’s native CDC support

Flink can consume CDC streams directly via Debezium or other formats without requiring hidden columns in SQL; it automatically adds hidden CDC metadata.

2.2 Flink’s support for Change‑Log streams

After connecting a change‑log stream, Flink’s topology does not need to handle change‑log flags; the pipeline can be defined purely by business logic and write to Iceberg.

2.3 Evaluation of Flink + Iceberg CDC import

Two main approaches are discussed: Copy‑On‑Write (efficient for updates that rewrite only a few files) and Merge‑On‑Read (supports near‑real‑time reads by appending CDC flags and merging on demand). Iceberg provides a unified table format, columnar storage, snapshot‑based incremental reads, and no online service nodes, making it suitable for both batch and streaming workloads.

3. How to Write and Read in Real Time

3.1 Batch update and CDC write scenarios

Batch updates include massive SQL‑based row updates (e.g., GDPR user deletion) and large‑scale conditional deletions without primary‑key constraints.

CDC write scenarios involve fast binlog ingestion to the lake and upsert streams that require high‑frequency, low‑latency writes.

3.2 Design considerations for Iceberg CDC writes

Correctness – upsert data must stay consistent with upstream sources.

High‑throughput writes – handle frequent upserts efficiently.

Fast reads – support fine‑grained concurrency and columnar acceleration.

Incremental reads – enable ETL‑style incremental processing.

3.3 Iceberg basics

Iceberg stores data files (e.g., Parquet) alongside metadata files (snapshots, manifests, table metadata). Snapshots reference manifests, which map data files to partitions.

3.4 INSERT, UPDATE, DELETE handling

Updates are decomposed into delete + insert operations. Position‑delete files record file‑path and row‑offset deletions, while equality‑delete files record key‑based deletions, ensuring correct semantics across transactions.

3.5 Manifest file design

Manifests are split into data manifests and delete manifests to quickly locate delete files for each data file, enabling balanced task distribution during merges.

3.6 File‑level concurrency

Iceberg supports concurrent reads at the file and sub‑file level (e.g., splitting a 256 MB file into two 128 MB parallel reads), improving throughput.

3.7 Transaction commit of incremental file sets

IcebergStreamWriter writes data files, while IcebergFileCommitter gathers them and commits the transaction to the metastore, making the data visible without external services.

4. Future Planning

Planned improvements include core Iceberg optimizations, full‑link stability testing, CDC incremental pull Table API, automatic/manual merge of CDC files in Flink, and broader ecosystem integration with Spark, Presto, and Alluxio acceleration.

FlinkIcebergCDC
Big Data Technology Architecture
Written by

Big Data Technology Architecture

Exploring Open Source Big Data and AI Technologies

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.