Big Data 14 min read

How Flink + Iceberg Transform Data Lakes for Real‑Time Streaming

This article explains the concept of data lakes, outlines a four‑layer open‑source architecture, presents several classic Flink‑Iceberg use cases, details why Iceberg was chosen, and describes the design of Flink’s streaming sink and upcoming community roadmap.

dbaplus Community

Jun 5, 2021

How Flink + Iceberg Transform Data Lakes for Real‑Time Streaming

Data Lake Overview

A data lake is a unified platform that stores all enterprise data—structured, semi‑structured, unstructured, and binary—in its raw form, enabling diverse processing workloads such as streaming jobs, batch jobs, and machine‑learning analytics.

The lake has four key characteristics:

It stores raw data from many sources.

It supports multiple compute models.

It provides comprehensive data‑management capabilities, including schema evolution and access control.

It relies on cheap distributed storage (e.g., S3, OSS, HDFS) with suitable file formats and caching.

Typical Open‑Source Architecture

The architecture is divided into four layers:

Bottom layer: distributed file system (cloud object stores like S3/OSS or on‑premise HDFS).

Acceleration layer: caches hot data on compute nodes to achieve hot‑cold separation, often using Alluxio or Alibaba Cloud JindoFS.

Table‑format layer: wraps data files into logical tables offering ACID, snapshots, schema, and partitioning. Popular projects include Apache Iceberg, Delta Lake, and Apache Hudi.

Compute layer: engines such as Flink, Spark, Hive, Presto, etc., can read/write the same tables.

Classic Business Scenarios

Real‑time data pipeline: ingest logs into Kafka, process with Flink, and write the results into an Iceberg table.

Change‑data‑capture (CDC) from relational databases: Flink’s CDC connector parses binlog events, and Iceberg’s equality‑delete feature enables streaming deletions.

Near‑real‑time lambda architecture: Flink writes incremental data to Iceberg for fast queries, while batch jobs read snapshots for full‑history analysis.

Bootstrapping new Flink jobs: use a full‑history Iceberg snapshot plus recent Kafka increments to replay a year of data before switching to live streams.

Correcting real‑time results: historical Iceberg data can be re‑processed to adjust earlier streaming outputs.

Why Choose Apache Iceberg

Iceberg decouples compute engines from storage, making it a universal table format that works with many processors. Its design supports both batch and streaming workloads, offers efficient manifest and snapshot handling, and enjoys strong community backing from companies such as Netflix, Apple, LinkedIn, Adobe, and Tencent.

Compared with Delta Lake and Hudi, Iceberg is not tied to Spark, allowing Flink to leverage its own streaming‑batch capabilities without the 15‑minute partition‑level latency typical of Hive.

Flink + Iceberg Streaming Sink

Iceberg 0.10.0 introduced a streaming sink for Flink. The sink is split into two operators:

IcebergStreamWriter : converts incoming records into Avro/Parquet/ORC files, creates an Iceberg DataFile, and forwards it downstream.

IcebergFilesCommitter : on each checkpoint, gathers all pending DataFile objects and commits a transaction to Iceberg.

Iceberg uses optimistic locking; concurrent commits retry until the earlier transaction succeeds. Therefore a single‑threaded commit path avoids massive transaction conflicts.

The committer maintains state as Map<Long, List<DataFile>>, ensuring that if a checkpoint’s transaction fails, its files remain in state and can be retried in a later checkpoint.

Future Community Plans

Iceberg 0.11.0 will add automatic small‑file merging triggered by Flink checkpoints and introduce a streaming reader for Flink. Iceberg 0.12.0 targets row‑level deletes and upsert support, enabling full CDC pipelines where Flink can write and update Iceberg tables in real time.

Reference documentation and source code are available at https://github.com/apache/iceberg/blob/master/site/docs/flink.md

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Apache Flink Streaming Data Lake Apache Iceberg table format

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Data Lake Overview

Typical Open‑Source Architecture

Classic Business Scenarios

Why Choose Apache Iceberg

Flink + Iceberg Streaming Sink

Future Community Plans

dbaplus Community

How this landed with the community

Was this worth your time?

0 Comments

Flink + Iceberg Streaming Sink