Integrating Apache Flink with Data Lakes Using Apache Iceberg: Architecture, Use Cases, and Future Roadmap
This article explains how Apache Flink combines with Apache Iceberg to build unified stream‑batch data lake solutions, covering data lake fundamentals, architectural layers, classic business scenarios, reasons for choosing Iceberg, streaming ingestion design, and upcoming community enhancements.
Apache Flink is a widely used unified stream‑batch engine, and data lakes have become a new cloud‑native architecture; this talk examines what happens when Flink meets a data lake.
Data lakes store raw data from diverse sources, support multiple compute models, provide comprehensive data management (schema, permissions, lineage), and rely on flexible storage such as S3, OSS, or HDFS with specialized file formats.
The typical open‑source data lake stack consists of four layers: (1) distributed file system (e.g., S3, OSS, HDFS); (2) data acceleration layer (e.g., Alluxio, JindoFS) to cache hot data and reduce remote reads; (3) table‑format layer (e.g., Apache Iceberg, Delta, Hudi) that offers ACID, snapshots, schema and partition semantics; and (4) compute engines (e.g., Flink, Spark, Presto, Hive) that can read the same tables.
Several classic business scenarios illustrate the power of Flink + Iceberg: real‑time data pipelines that ingest Kafka streams into Iceberg tables and chain incremental jobs; CDC ingestion from relational databases (MySQL binlog) using Flink CDC connectors; near‑real‑time lambda architectures that replace separate streaming and batch pipelines with a unified Flink + Iceberg approach; bootstrapping new Flink jobs by replaying full historical data from Iceberg and then switching to Kafka increments; and using Iceberg as a low‑cost historical store to correct streaming results.
Iceberg was chosen because it provides a compute‑agnostic table format, decouples storage from processing, supports fine‑grained snapshots and manifests for both batch and streaming, and has a strong, active community with high code‑review standards.
Streaming ingestion is realized in Flink 0.10.0: the sink is split into two operators— IcebergStreamWriter , which writes records to Avro/Parquet/ORC files and creates Iceberg DataFile objects, and IcebergFilesCommitter , which collects these files per checkpoint and commits an Iceberg transaction. Iceberg uses optimistic locking, so concurrent commits retry until one succeeds. The committer maintains state as map<Long, List<DataFile>> to handle failed checkpoints.
Future community plans include Iceberg 0.11.0 automatic small‑file merging triggered by Flink checkpoints and the development of a Flink streaming reader, while Iceberg 0.12.0 will address row‑level deletes and upsert support for CDC workloads.
The presentation was authored by Hu Zheng (Zi‑yi), an Alibaba technical expert who contributes to Flink, Iceberg, and HBase projects.
Big Data Technology Architecture
Exploring Open Source Big Data and AI Technologies
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.