Real-time CDC Data Read/Write Solutions in Data Lake Architecture with Flink and Iceberg
This article, compiled by community volunteers, examines various CDC data real‑time read/write solutions for data lake architectures, comparing offline HBase, Apache Kudu, Hive, Spark + Delta, and ultimately advocating Flink + Iceberg for efficient, correct, and scalable streaming ingestion and analytics.
Abstract
This article, authored by Li Jinsong and Hu Zheng and organized by community volunteers Yang Weihai and Li Peidian, introduces real‑time CDC data read/write solutions within data‑lake architectures. It is divided into four parts: common CDC analysis schemes, reasons for choosing Flink + Iceberg, how to write and read in real time, and future plans.
1. Common CDC Analysis Schemes
Offline HBase cluster analysis of CDC data – HBase provides low‑latency point queries but is unsuitable for OLAP because it uses row‑store indexing, has high maintenance cost, and stores data in HFile, which does not integrate well with columnar formats such as Parquet.
Apache Kudu – adds column‑store capabilities to HBase‑like point queries, improving OLAP performance, but suffers from higher maintenance cost, limited batch‑scan speed, weak delete support, and lack of incremental pull.
Direct import of CDC into Hive – uses full‑partition plus daily incremental partitions with a merge step; however, it introduces latency (typically T+1), requires full data rewrite on each merge, and is resource‑intensive.
Spark + Delta – leverages Delta’s MERGE INTO syntax to rewrite only changed files, offering faster incremental updates compared with Hive.
2. Why Choose Flink + Iceberg
2.1 Flink’s native CDC support
Flink can consume CDC streams directly via Debezium or other formats without requiring hidden columns in SQL; it automatically adds hidden CDC metadata.
2.2 Flink’s support for Change‑Log streams
After connecting a change‑log stream, Flink’s topology does not need to handle change‑log flags; the pipeline can be defined purely by business logic and write to Iceberg.
2.3 Evaluation of Flink + Iceberg CDC import
Two main approaches are discussed: Copy‑On‑Write (efficient for updates that rewrite only a few files) and Merge‑On‑Read (supports near‑real‑time reads by appending CDC flags and merging on demand). Iceberg provides a unified table format, columnar storage, snapshot‑based incremental reads, and no online service nodes, making it suitable for both batch and streaming workloads.
3. How to Write and Read in Real Time
3.1 Batch update and CDC write scenarios
Batch updates include massive SQL‑based row updates (e.g., GDPR user deletion) and large‑scale conditional deletions without primary‑key constraints.
CDC write scenarios involve fast binlog ingestion to the lake and upsert streams that require high‑frequency, low‑latency writes.
3.2 Design considerations for Iceberg CDC writes
Correctness – upsert data must stay consistent with upstream sources.
High‑throughput writes – handle frequent upserts efficiently.
Fast reads – support fine‑grained concurrency and columnar acceleration.
Incremental reads – enable ETL‑style incremental processing.
3.3 Iceberg basics
Iceberg stores data files (e.g., Parquet) alongside metadata files (snapshots, manifests, table metadata). Snapshots reference manifests, which map data files to partitions.
3.4 INSERT, UPDATE, DELETE handling
Updates are decomposed into delete + insert operations. Position‑delete files record file‑path and row‑offset deletions, while equality‑delete files record key‑based deletions, ensuring correct semantics across transactions.
3.5 Manifest file design
Manifests are split into data manifests and delete manifests to quickly locate delete files for each data file, enabling balanced task distribution during merges.
3.6 File‑level concurrency
Iceberg supports concurrent reads at the file and sub‑file level (e.g., splitting a 256 MB file into two 128 MB parallel reads), improving throughput.
3.7 Transaction commit of incremental file sets
IcebergStreamWriter writes data files, while IcebergFileCommitter gathers them and commits the transaction to the metastore, making the data visible without external services.
4. Future Planning
Planned improvements include core Iceberg optimizations, full‑link stability testing, CDC incremental pull Table API, automatic/manual merge of CDC files in Flink, and broader ecosystem integration with Spark, Presto, and Alluxio acceleration.
Big Data Technology Architecture
Exploring Open Source Big Data and AI Technologies
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
