How Flink + Iceberg Transform Data Lakes for Real‑Time Streaming
This article explains the concept of data lakes, outlines a four‑layer open‑source architecture, presents several classic Flink‑Iceberg use cases, details why Iceberg was chosen, and describes the design of Flink’s streaming sink and upcoming community roadmap.
Data Lake Overview
A data lake is a unified platform that stores all enterprise data—structured, semi‑structured, unstructured, and binary—in its raw form, enabling diverse processing workloads such as streaming jobs, batch jobs, and machine‑learning analytics.
The lake has four key characteristics:
It stores raw data from many sources.
It supports multiple compute models.
It provides comprehensive data‑management capabilities, including schema evolution and access control.
It relies on cheap distributed storage (e.g., S3, OSS, HDFS) with suitable file formats and caching.
Typical Open‑Source Architecture
The architecture is divided into four layers:
Bottom layer: distributed file system (cloud object stores like S3/OSS or on‑premise HDFS).
Acceleration layer: caches hot data on compute nodes to achieve hot‑cold separation, often using Alluxio or Alibaba Cloud JindoFS.
Table‑format layer: wraps data files into logical tables offering ACID, snapshots, schema, and partitioning. Popular projects include Apache Iceberg, Delta Lake, and Apache Hudi.
Compute layer: engines such as Flink, Spark, Hive, Presto, etc., can read/write the same tables.
Classic Business Scenarios
Real‑time data pipeline: ingest logs into Kafka, process with Flink, and write the results into an Iceberg table.
Change‑data‑capture (CDC) from relational databases: Flink’s CDC connector parses binlog events, and Iceberg’s equality‑delete feature enables streaming deletions.
Near‑real‑time lambda architecture: Flink writes incremental data to Iceberg for fast queries, while batch jobs read snapshots for full‑history analysis.
Bootstrapping new Flink jobs: use a full‑history Iceberg snapshot plus recent Kafka increments to replay a year of data before switching to live streams.
Correcting real‑time results: historical Iceberg data can be re‑processed to adjust earlier streaming outputs.
Why Choose Apache Iceberg
Iceberg decouples compute engines from storage, making it a universal table format that works with many processors. Its design supports both batch and streaming workloads, offers efficient manifest and snapshot handling, and enjoys strong community backing from companies such as Netflix, Apple, LinkedIn, Adobe, and Tencent.
Compared with Delta Lake and Hudi, Iceberg is not tied to Spark, allowing Flink to leverage its own streaming‑batch capabilities without the 15‑minute partition‑level latency typical of Hive.
Flink + Iceberg Streaming Sink
Iceberg 0.10.0 introduced a streaming sink for Flink. The sink is split into two operators:
IcebergStreamWriter : converts incoming records into Avro/Parquet/ORC files, creates an Iceberg DataFile, and forwards it downstream.
IcebergFilesCommitter : on each checkpoint, gathers all pending DataFile objects and commits a transaction to Iceberg.
Iceberg uses optimistic locking; concurrent commits retry until the earlier transaction succeeds. Therefore a single‑threaded commit path avoids massive transaction conflicts.
The committer maintains state as Map<Long, List<DataFile>>, ensuring that if a checkpoint’s transaction fails, its files remain in state and can be retried in a later checkpoint.
Future Community Plans
Iceberg 0.11.0 will add automatic small‑file merging triggered by Flink checkpoints and introduce a streaming reader for Flink. Iceberg 0.12.0 targets row‑level deletes and upsert support, enabling full CDC pipelines where Flink can write and update Iceberg tables in real time.
Reference documentation and source code are available at https://github.com/apache/iceberg/blob/master/site/docs/flink.md
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
