Big Data 19 min read

Comprehensive Overview of Data Lake Technologies: Iceberg, Hudi, and Delta Lake

This article provides an in-depth overview of data lake concepts, definitions, and essential features, followed by detailed case studies of enterprise data lake implementations and comparative analysis of leading data lake table formats—Iceberg, Hudi, and Delta Lake—highlighting their architectures, capabilities, and trade‑offs.

Big Data Technology & Architecture

Aug 24, 2021

Comprehensive Overview of Data Lake Technologies: Iceberg, Hudi, and Delta Lake

Is it hype or a future trend?

Data lakes have become a hot concept, with many enterprises either building or planning to build their own data lakes. Understanding what a data lake is, its basic components, and designing its architecture are crucial before starting a project.

Wikipedia defines a data lake as a system that stores data in its natural/raw format, typically as object blocks or files, including raw copies of source system data and transformed data for various tasks. It can store structured data (rows/columns), semi‑structured data (CSV, logs, XML, JSON), unstructured data (emails, documents, PDFs), and binary data (images, audio, video). AWS describes a data lake as a centralized repository that allows you to store all structured and unstructured data at any scale.

Microsoft’s definition is broader, focusing on the capabilities that enable developers, data scientists, and analysts to store and process data of any scale, type, and velocity across platforms and languages.

Key characteristics of a data lake include:

Providing sufficient storage capacity to hold all data of an organization.

Storing massive amounts of data of any type—structured, semi‑structured, and unstructured.

Preserving raw, complete copies of business data.

Offering comprehensive data‑management capabilities, including metadata, schema, source, format, connection info, and access control.

Supporting diverse analytics: batch, streaming, interactive, and machine‑learning, along with scheduling and management.

Enabling full data‑lifecycle management, from raw ingestion to intermediate results and detailed provenance.

Providing robust data acquisition and publishing capabilities for various sources and downstream applications.

Supporting big‑data scale storage and extensible processing.

In summary, a data lake is an evolving, scalable big‑data infrastructure that ingests, stores, processes, and manages data of any source, speed, scale, and type throughout its lifecycle.

Typical enterprise applications

Huawei production‑scene data lake platform

The platform is organized into three logical modules (storage, processing, services) as shown in the diagram below.

Typical data‑application scenarios include:

(Green) Structured data processed in batch, materialized into Hive tables, pre‑aggregated by Kylin into cubes, and exposed via REST APIs for sub‑second query latency and material‑quality monitoring.

(Red) IoT sensor data ingested through MQS, routed via Storm to HBase, then enriched by algorithmic models for ICT material early‑warning monitoring.

(Yellow) Barcode data loaded via ETLloader into an IQ columnar lake, cleaned, and made available for trillion‑scale barcode scanning operations.

Non‑structured quality‑inspection images are uploaded and queried via a web front‑end and data‑API service. Images are stored in HBase with a unique ID as the row key, enabling fast storage and retrieval. Data‑asset construction includes unified indexing for non‑structured data, metadata enrichment (type, size, dimensions), and cross‑type data association.

Real‑time financial data lake

The architecture comprises six functional layers: data sources, unified ingestion, storage, development, services, and applications. It supports structured, semi‑structured, and unstructured data, intelligent ingestion, hot‑cold‑warm storage distribution, task development, scheduling, monitoring, visual programming, interactive queries, APIs, SQL quality assessment, metadata and lineage management, as well as digital marketing, risk control, operations, and customer profiling.

The logical architecture consists of four layers: storage (MPP warehouse + OSS/HDFS lake), compute (unified metadata service), service (federated query engine and data‑service API), and product (RPA, document recognition, language analysis, customer profiling, intelligent recommendation, self‑service analytics, visualization, and data‑development platform).

Data flow: real‑time data is ingested into Kafka, processed by Flink, and written to a data lake built on open‑source components (HDFS/S3 storage, Iceberg table format). Flink can write intermediate results back to the lake for further processing, and final results are queried via engines such as Flink, Spark, or Presto.

Soul’s Delta Lake practice

Data from various endpoints is reported to Kafka, then Spark jobs write it to HDFS in Delta format at minute‑level intervals. Hive automatically creates mapping tables for Delta tables, enabling direct queries via Hive MR, Tez, Presto, etc.

A generic ETL tool built on Spark provides configuration‑driven ingestion without requiring user code. Key features include hidden‑partition support (e.g., creating a year column from a date), regex validation for dynamic partitions, custom event‑time fields, configurable nested‑JSON parsing depth, and SQL‑based dynamic partition configuration for mitigating data skew.

Data lake solution research

1. Iceberg

Iceberg abstracts a "table format" layer that decouples compute engines (Spark, Flink) and query engines (Hive, Presto) from underlying file formats (Parquet, ORC, Avro). It provides ACID transactions, time‑travel, rich type and partition abstractions, evolvable schemas, implicit partition handling, and cloud‑storage optimizations.

Iceberg’s architecture is engine‑agnostic, enabling seamless integration with Flink, Hive, Spark, etc., though it currently lacks row‑level update/delete capabilities.

2. Hudi

Hudi (Hadoop Upserts and Incrementals) enables minute‑level upserts on HDFS/S3, supporting both read‑optimized columnar (Parquet) and write‑optimized row‑based (Avro) storage. It maintains a timeline of operations, allowing point‑in‑time queries and efficient incremental reads.

Hudi’s storage formats:

Read‑optimized columnar format (ROFormat) stores data in Parquet; updates rewrite whole column files, suitable for read‑heavy workloads.

Write‑optimized row‑based format (WOFormat) combines Parquet and Avro, appending updates to Avro files and later compacting to Parquet, ideal for write‑heavy workloads.

3. Delta Lake

Delta Lake sits between Spark and storage, providing a metadata layer that solves metastore overload, supports ACID transactions, and offers rich update semantics (merge, update, delete). It stores data as Parquet with a transaction log for versioning and optimistic concurrency control.

Delta Lake is a library rather than a service, currently supporting only Spark, and can be used by simply switching the file format from Parquet to Delta.

Data lake technology comparison

Conclusion

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Flink Streaming Data Lake Iceberg Hudi Delta Lake

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.