Big Data 14 min read

Data Lake Evolution and a Practical Flink + Iceberg Implementation Guide

This article explores the evolution of data lakes, compares major cloud providers' lake architectures, introduces the emerging lakehouse concept, and provides a step‑by‑step Flink‑Iceberg implementation—including dependencies, catalog setup, table creation, checkpointing, and Kafka ingestion—demonstrating practical big‑data streaming solutions.

Big Data Technology & Architecture
Big Data Technology & Architecture
Big Data Technology & Architecture
Data Lake Evolution and a Practical Flink + Iceberg Implementation Guide

The Past and Present of Data Lakes

In today’s internet‑driven era, data has become one of the most valuable assets for companies. Collecting, storing, and analyzing data are now core technical components of technology firms. Over the past decade, the big‑data field has rapidly advanced, with both real‑time and batch processing, data warehouses, and data middle‑platforms deeply embedded in business operations.

The term “data lake” entered mainstream discussion around mid‑2020, yet no single definition has been universally accepted. Common points from Wikipedia, AWS, and Alibaba Cloud describe a data lake as a platform that supports multiple compute engines (e.g., Flink, Spark, Hive), both streaming and batch workloads, various storage engines (structured stores such as MySQL, HBase, OLAP databases, and unstructured stores like HDFS), ACID‑compliant updates, and unified metadata management with enterprise‑grade access control.

Data Lake Architectures of Major Cloud Vendors

Alibaba Cloud

Alibaba Cloud advertises a cloud‑native enterprise data lake solution with four key advantages: massive elasticity via compute‑storage separation, ecosystem openness to Hadoop tools, cost‑effectiveness through a unified storage pool with tiered hot/cold layers, and simplified management (encryption, authorization, lifecycle, cross‑region replication).

It also provides an open‑source‑based data lake construction approach, supporting a wide range of data sources (logs, messages, databases, HDFS) and seamless integration with Hive, Spark, Presto, Impala, etc., while offering Data Lake Formation for metadata management and acceleration.

AWS

AWS introduced Lake Formation in 2018, built on S3 and NoSQL storage. Lake Formation handles metadata definition, ingesting crawled data, ETL outputs, logs, etc., and provides a comprehensive permission model.

Huawei Cloud

Huawei’s Data Lake Governance Center fully supports Spark and Flink ecosystems, offering serverless, one‑click streaming, batch, and interactive analytics. It supports standard SQL, Spark SQL, and Flink SQL, multiple ingestion methods, and heterogeneous data formats without complex ETL.

Overall, data lakes are not a brand‑new technology but an evolution of data philosophy, with maturity judged by governance, metadata management, compute capabilities, and access control.

Is the Lakehouse the Future?

The Lakehouse architecture merges the strengths of traditional data warehouses and data lakes. Originating from Databricks’ “What is a Lakehouse?” concept, it offers cheaper, elastic storage and improved upstream data quality. Key Lakehouse features include transaction support, schema evolution, end‑to‑end streaming, and compute‑storage separation.

Open‑source projects such as Iceberg, Hudi, and Delta Lake provide unified table formats that enable multi‑engine queries and reduce management, storage, and compute costs.

Flink + Iceberg Practical Guide

2.1 The Three “Swords” of Data Lakes

Data lake solutions must bridge the gap between storage formats and compute engines (e.g., Flink, Spark, Kafka). The three mature open‑source table formats—Delta, Apache Iceberg, and Apache Hudi—address this need, each offering distinct trade‑offs that users should evaluate based on workload requirements.

2.2 Flink + Iceberg Development Case

Apache Iceberg is described as an open table format for massive analytic datasets, adding high‑performance tables to Trino and Spark that behave like traditional SQL tables.

Key Iceberg capabilities include schema evolution, hidden partitioning, partition layout evolution, snapshot control, version rollback, fast scans, data pruning, broad compatibility, ACID transactions, and high‑concurrency optimistic writes.

These features directly address common pain points such as ACID support, batch/stream read‑write, and multi‑engine compatibility. Iceberg’s community actively collaborates with Flink, providing connectors that simplify development. At the time of writing, Iceberg had reached version 0.12.0.

2.3 Integrating Flink with Iceberg

The classic Iamda architecture is used: Kafka streams are consumed by Flink, which writes data into an Iceberg table stored on HDFS.

First, add the Maven dependency:

<dependency>
    <groupId>org.apache.iceberg</groupId>
    <artifactId>iceberg-flink-runtime</artifactId>
    <version>0.11.1</version>
</dependency>

Create an Iceberg catalog that maps Iceberg tables to HDFS locations:

CREATE CATALOG iceberg_catalog WITH (
  'type'='iceberg',
  'catalog-type'='hive',
  'warehouse'='hdfs://localhost/user/hive/warehouse',
  'uri'='thrift://localhost:9083'
);

Define the target table within the catalog:

CREATE TABLE iceberg_catalog.iceberg_hadoop_db.iceberg_table (
    user_id STRING,
    amount DOUBLE,
    time_stamp STRING
) PARTITIONED BY (time_stamp)
WITH (
  'connector'='iceberg',
  'write.format.default'='orc'
);

Enable checkpointing in Flink to guarantee exactly‑once semantics:

StreamExecutionEnvironment env = StreamExecutionEnvironment.createLocalEnvironment();

env.enableCheckpointing(300 * 1000);
env.getCheckpointConfig().setCheckpointTimeout(60000);

Register a Hive catalog and create a Kafka source table to read streaming data:

String HIVE_CATALOG = "hive_catalog";
String DEFAULT_DATABASE = "tmp";
String HIVE_CONF_DIR = "/xx/resources";

Catalog catalog = new HiveCatalog("hive_catalog", "hive_catalog_database", "/user/hive/resources");
tenv.registerCatalog("hive_catalog", catalog);
tenv.useCatalog("hive_catalog");
// create kafka source table
tenv.executeSql("DROP TABLE IF EXISTS kafka_source_iceberg");
tenv.executeSql(
    "CREATE TABLE kafka_source_iceberg (
" +
    "  user_id STRING,
" +
    "  amount DOUBLE,
" +
    "  time_stamp STRING
" +
    ") WITH (
" +
    "  'connector'='kafka',
" +
    "  'topic'='kafka_topic',
" +
    "  'scan.startup.mode'='latest-offset',
" +
    "  'properties.bootstrap.servers'='localhost:9092',
" +
    "  'properties.group.id'='iceberg_group',
" +
    "  'format'='json'
" +
    ")"
);

Finally, insert the streamed data into the Iceberg table:

tenv.executeSql(
    "INSERT INTO iceberg_catalog.iceberg_hadoop_db.iceberg_table " +
    "SELECT user_id, amount, time_stamp FROM hive_catalog.hive_catalog_database.kafka_source_iceberg"
);

This completes the end‑to‑end real‑time data ingestion pipeline.

Conclusion

Data lakes are rapidly evolving, with open‑source communities delivering continuous innovations. The convergence of data lakes and lakehouses is poised to become the dominant architecture for data engineering, making mastery of these technologies essential for modern data developers.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

FlinkSQLStreamingData LakeIceberg
Big Data Technology & Architecture
Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.