Big Data 19 min read

Real‑time CDC Data Read/Write in Data Lakes: Flink + Iceberg and Alternative Solutions

This article reviews common CDC analysis architectures, compares offline HBase, Apache Kudu, Hive and Spark + Delta approaches, and explains why Flink combined with Iceberg offers a more efficient, low‑latency solution for real‑time CDC ingestion, storage, and query in modern data lake environments.

DataFunTalk

Feb 24, 2021

Real‑time CDC Data Read/Write in Data Lakes: Flink + Iceberg and Alternative Solutions

1. Common CDC Analysis Solutions

The input for CDC pipelines is either change‑log (CDC) data or upsert streams, which need to be written to a database or a data‑lake storage for OLAP analysis.

1.1 Offline HBase Cluster Analysis CDC Data

Using Flink to process CDC upsert data and write it to HBase provides low‑latency point‑lookup capabilities, but HBase’s row‑store design, high maintenance cost, and incompatibility with columnar formats (Parquet, ORC) make it unsuitable for large‑scale OLAP queries.

1.2 Apache Kudu for CDC Data Sets

Kudu adds columnar storage to HBase‑like point‑lookup, improving OLAP performance, yet it suffers from higher operational overhead, limited batch‑scan speed, weak delete support, and lack of incremental pull.

1.3 Direct CDC Import into Hive

Hive stores CDC data in full‑partitioned tables and merges daily increments, which introduces a T+1 latency and requires full data rewrites for each merge, reducing real‑time capability.

1.4 Spark + Delta Analysis CDC Data

Spark + Delta leverages the MERGE INTO syntax to rewrite only changed files, offering efficient incremental updates, but still incurs higher latency than true streaming solutions.

2. Why Choose Flink + Iceberg

2.1 Flink’s Native CDC Support

Flink can consume CDC streams (e.g., Debezium) without explicit CDC columns; hidden columns automatically capture change‑type metadata, simplifying SQL development.

2.2 Flink’s Change‑Log Stream Support

Flink processes change‑log streams without requiring the user to handle CDC flags, allowing seamless downstream writes to Iceberg.

2.3 Evaluation of Flink + Iceberg CDC Import

Copy‑On‑Write (CoW) provides fast reads for bulk updates, while Merge‑On‑Read (MoR) enables near‑real‑time upserts; Iceberg’s snapshot‑based table format supports columnar storage, multiple compute engines, and incremental reads without an online service layer.

3. Real‑time Write and Read

3.1 Batch Update and CDC Write Scenarios

Two batch‑update cases are described: GDPR‑style full deletions and large‑scale conditional deletions. CDC write scenarios include fast binlog ingestion and upsert streams that require high‑frequency updates.

3.2 Iceberg CDC Write Design Considerations

Correctness – ensure downstream data matches upstream upserts.

Efficient Write – support high‑throughput, concurrent upserts.

Fast Read – enable fine‑grained parallelism and columnar scan acceleration.

Incremental Read – allow ETL‑style incremental consumption.

3.3 Iceberg Insert, Update, Delete Mechanics

Updates are decomposed into delete + insert operations; delete files can be position‑based, equality‑based, or a mix to guarantee correctness across transactions.

3.4 Manifest File Design

Separating data manifests from delete manifests enables quick lookup of relevant delete files per data file, facilitating balanced task distribution.

3.5 File‑Level Concurrency

Iceberg supports parallel reads at the file and sub‑file level (e.g., splitting a 256 MB file into two 128 MB splits), improving task parallelism and overall throughput.

3.6 Incremental Transaction Commit

IcebergStreamWriter writes data files, while IcebergFileCommitter aggregates them and commits the transaction, exposing the new snapshot without external services.

4. Future Planning

Planned work includes core Iceberg optimizations, CDC incremental pull APIs, automatic and manual file‑merge capabilities in Flink, broader ecosystem integration (Spark, Presto) and Alluxio acceleration.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

1. Common CDC Analysis Solutions

1.1 Offline HBase Cluster Analysis CDC Data

1.2 Apache Kudu for CDC Data Sets

1.3 Direct CDC Import into Hive

1.4 Spark + Delta Analysis CDC Data

2. Why Choose Flink + Iceberg

2.1 Flink’s Native CDC Support

2.2 Flink’s Change‑Log Stream Support

2.3 Evaluation of Flink + Iceberg CDC Import

3. Real‑time Write and Read

3.1 Batch Update and CDC Write Scenarios

3.2 Iceberg CDC Write Design Considerations

3.3 Iceberg Insert, Update, Delete Mechanics

3.4 Manifest File Design

3.5 File‑Level Concurrency

3.6 Incremental Transaction Commit

4. Future Planning

DataFunTalk

How this landed with the community

Was this worth your time?

0 Comments

1.4 Spark + Delta Analysis CDC Data

2. Why Choose Flink + Iceberg

2.3 Evaluation of Flink + Iceberg CDC Import