Understanding Data Lakes and a Comparative Overview of Iceberg, Hudi, and Delta Lake
This article explains what a data lake is, outlines its key characteristics, and compares three major data lake frameworks—Iceberg, Hudi, and Delta Lake—highlighting their architectures, features, and trade‑offs for large‑scale data storage and processing.
What is a Data Lake
Data lake is a currently hot concept; many enterprises are building or planning to build their own data lakes. Before embarking on a data lake project, it is essential to understand what a data lake is, identify its basic components, and design its fundamental architecture.
According to Wikipedia, a data lake is a system or storage that holds data in its natural/raw format, typically as object blocks or files. It includes raw copies of source system data as well as transformed data for various tasks, covering structured data (rows and columns), semi‑structured data (CSV, logs, XML, JSON), unstructured data (email, documents, PDFs), and binary data (images, audio, video).
AWS defines a data lake as a centralized repository that allows you to store all structured and unstructured data at any scale.
Microsoft’s definition is more functional: a data lake provides capabilities that enable developers, data scientists, and analysts to store and process data of any size, type, and ingestion speed, across platforms and languages, for all kinds of analysis and processing.
Although many definitions exist, they generally revolve around the following characteristics:
1. A data lake must provide sufficient storage capacity to hold all data of an enterprise or organization.
2. It can store massive amounts of data of any type, including structured, semi‑structured, and unstructured data.
3. The data in a lake is raw and represents a complete copy of the business data, preserving its original form.
4. A data lake needs comprehensive data management capabilities (metadata) to handle data sources, formats, connection information, schemas, and permission management.
5. It must support diverse analytics capabilities, such as batch processing, stream processing, interactive analysis, and machine learning, along with task scheduling and management.
6. Lifecycle management is required: the lake should store raw data and also retain intermediate results of analyses, recording the full processing lineage to enable detailed traceability.
7. Robust data ingestion and publishing capabilities are needed to ingest full or incremental data from various sources and to expose processed results to appropriate storage engines for different applications.
8. Support for big‑data workloads, including ultra‑large storage and scalable processing, is essential.
In summary, a data lake can be viewed as an evolving, scalable big‑data storage, processing, and analysis infrastructure that is data‑centric, capable of ingesting, storing, processing, and managing data of any source, speed, scale, and type throughout its lifecycle.
Data Lake Survey
1. Iceberg
Iceberg, as an emerging data‑lake framework, introduces the concept of a "table format" as an intermediate layer that is independent of both the compute engines (e.g., Spark, Flink) and query engines (e.g., Hive, Presto), while also being decoupled from underlying file formats such as Parquet, ORC, and Avro.
Additional capabilities of Iceberg include:
ACID transactions;
Time travel to access previous versions of data;
Rich custom types, partitioning, and operation abstractions;
Evolving column and partition schemas without user impact;
Implicit partitioning that removes the need for SQL‑level partition optimizations;
Optimizations for cloud storage.
Iceberg’s architecture is not tied to a specific engine; it implements a universal data‑organization format that can be easily integrated with different engines such as Flink, Hive, and Spark.
The architecture is elegant, offering a complete definition of data formats and type systems with evolutionary design, but it currently lacks row‑level update and delete capabilities, which the community is still working to improve.
2. Hudi
Generally, large volumes of data are stored in HDFS or S3, with new data appended incrementally while older data changes rarely, especially after data cleansing for a data warehouse.
Traditional data warehouses like Hive provide limited support for updates, making such operations costly. Moreover, scenarios that require analysis of only recent incremental data lack native support in Hive, Presto, or HBase, often requiring timestamp‑based filtering.
Apache Hudi (Hadoop Upserts and Incrementals) enables HDFS datasets to support changes with minute‑level latency and allows downstream systems to consume incremental updates.
Hudi datasets are compatible with the Hadoop ecosystem via a custom InputFormat, supporting Apache Hive, Parquet, Presto, and Spark, enabling seamless integration for end users.
The architecture includes a timeline core that records a timestamp for every operation (write, delete, merge). This timeline allows queries for data after a specific point or before a specific point, avoiding scanning unnecessary time ranges and efficiently consuming only changed files.
Data is organized in a directory structure similar to Hive tables, partitioned by unique paths. Hudi stores data in two formats:
Read‑Optimized Format (ROFormat) : Uses columnar files (Parquet) for storage. Writes create new base files, making writes expensive but reads cheap, suitable for read‑heavy workloads.
Write‑Optimized Format (WOFormat) : Combines columnar (Parquet) and row‑based (Avro) files. Updates are written to incremental Avro files and later compacted into new Parquet base files, favoring write‑heavy workloads.
3. Delta Lake
The traditional Lambda architecture requires maintaining separate batch and stream processing systems, leading to high resource consumption and complexity. Existing Hive‑based warehouses or file formats (Parquet/ORC) suffer from issues such as small files, concurrent read/write conflicts, limited update support, and metadata overload.
Delta Lake acts as a storage layer between Spark and the underlying storage, adding schema information and offering key features:
Metadata system built on HDFS that alleviates metastore pressure;
Support for richer update modes (Merge, Update, Delete) and streaming writes/reads, enabling real‑time data lakes;
Unified batch and streaming operations on the same table;
Versioning that allows rollback to previous states, preventing catastrophic data loss.
Delta Lake stores all data in Parquet, leveraging Parquet’s native compression and encoding. It provides ACID transaction guarantees across concurrent writes, with a transaction log that records file‑level writes and uses optimistic concurrency control. Conflicts raise exceptions that applications can handle and retry.
Delta Lake is a library rather than a standalone service; it does not require separate deployment and works primarily with the Spark engine, making its adoption low‑cost.
4. Data Lake Technology Comparison
Summary
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
