Big Data 13 min read

Comparative Analysis of Hudi, Iceberg, and Delta Lake for Data Lake Storage

This article compares three open‑source data‑lake storage layers—Hudi, Iceberg, and Delta Lake—examining their shared reliance on meta‑files for schema and transaction handling, and detailing their differing designs for upserts, streaming support, query performance, and ecosystem integration.

Big Data Technology Architecture
Big Data Technology Architecture
Big Data Technology Architecture
Comparative Analysis of Hudi, Iceberg, and Delta Lake for Data Lake Storage

Author: Xin Yong, EMR Technical Expert at Alibaba Computing Platform Division, contributor to Apache Hadoop and Apache Spark, with deep research on Hadoop, Spark, Hive, Druid and current work on big‑data cloudification.

Common Points

Qualitatively, all three are intermediate layers of Data Lake storage whose data‑management functions rely on a series of meta files. These meta files act like a database catalog/WAL, providing schema management, transaction management, and data management. Unlike traditional databases, the meta files reside alongside the data files in the storage engine and are directly visible to users, inheriting the big‑data tradition of user‑visible data but also increasing the risk of accidental damage. Deleting a meta directory can corrupt a table and make recovery difficult.

The meta files contain the table’s schema information, enabling the system to track schema evolution and provide ACID and multi‑version support through transaction logs. Every table change generates a new meta file, allowing history access. In these aspects, the three systems are similar.

Hudi

Hudi (Hadoop Upserts Deletes and Incrementals) focuses on upserts, deletes, and incremental processing. It offers three write modes via Spark HudiDataSource API and DeltaStreamer: UPSERT, INSERT, and BULK_INSERT. Deletions are supported through write‑time options rather than a pure delete API.

Typical usage streams upstream data from Kafka or Sqoop into Hudi via DeltaStreamer, a long‑running service that pulls data in batches and can trigger small‑file compaction automatically. Hudi supports Hive, Spark, and Presto for queries.

Performance relies on HoodieKey (a primary‑key‑like identifier) with min/max statistics and Bloom filters for fast record location. Upserts check the Bloom filter first; if the key is absent, an insert occurs, otherwise an update is performed, avoiding full‑table joins.

Hudi provides both Copy‑On‑Write (writes merge data, slightly slower writes but faster reads) and Merge‑On‑Read (merges at read time, enabling near‑real‑time analytics). It also includes a run_sync_tool script to sync schema to Hive and a command‑line tool for table management.

Iceberg

Iceberg does not use a HoodieKey design and does not emphasize a primary key. Updates, deletes, and merges must be implemented via joins, requiring an SQL execution engine. Iceberg does not bind to a specific engine; it supports Spark and Presto for queries but lacks native delete/update APIs. Users typically perform overwrites on affected partitions for updates.

Iceberg’s query performance benefits from the hidden‑partition feature, where users can transform columns (e.g., hour(timestamp)) to create invisible partition columns used for data organization and pruning. It also collects extensive column statistics (size, value count, null count, min/max) for data skipping.

Table creation is done through an API that specifies name, schema, and partition information, registering the table in a Hive catalog.

Delta

Delta targets a unified streaming‑batch Data Lake layer, supporting update/delete/merge. Originating from Databricks, it integrates tightly with Spark for all write modes (batch, streaming, SQL INSERT/INSERT OVERWRITE). Like Iceberg, Delta does not emphasize a primary key; updates are implemented via Spark joins.

Delta’s write path is strongly coupled with Spark, unlike Hudi which can use Spark or its own tools. For queries, open‑source Delta supports Spark and Presto, but Presto requires a preceding Spark job to generate a SymlinkTextInputFormat file. EMR has added a DeltaInputFormat to allow direct Presto queries without Spark.

Performance‑wise, open‑source Delta lacks many optimizations (e.g., hidden partitions, column stats). Databricks adds Data Skipping and file‑level caching, but the open version does not. EMR is working on closing these gaps.

Despite some shortcomings, Delta’s strengths lie in its deep Spark integration, unified streaming‑batch design, and support for Lambda/Kappa architecture improvements, making it flexible for analytics, machine learning, and CDC scenarios.

Conclusion

The three engines serve different primary scenarios: Hudi focuses on incremental upserts, Iceberg on high‑performance analytics and reliable data management, and Delta on unified streaming‑batch processing. Their design differences reflect these goals, though all are evolving and may converge over time.

The table below summarizes capabilities of each engine as of the end of 2019.

Delta

Hudi

Iceberg

Incremental Ingestion

Spark

Spark

Spark

ACID updates

HDFS, S3 (Databricks), OSS

HDFS

HDFS, S3

Upserts/Delete/Merge/Update

Delete/Merge/Update

Upserts/Delete

No

Streaming sink

Yes

Yes

Yes (not ready?)

Streaming source

Yes

No

No

File Formats

Parquet

Avro, Parquet

Parquet, ORC

Data Skipping

File‑Level Max‑Min stats + Z‑Ordering (Databricks)

File‑Level Max‑Min stats + Bloom Filter

File‑Level Max‑Min Filtering

Concurrency control

Optimistic

Optimistic

Optimistic

Data Validation

Yes (Databricks)

No

Yes

Merge on read

No

Yes

No

Schema Evolution

Yes

Yes

Yes

File I/O Cache

Yes (Databricks)

No

No

Cleanup

Manual

Automatic

No

Compaction

Manual

Automatic

No

Note: The author acknowledges possible inaccuracies and welcomes feedback.

If you enjoyed this article, please click “Like” and share—it’s a great encouragement.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

metadata managementSparkIcebergHudiDelta Lake
Big Data Technology Architecture
Written by

Big Data Technology Architecture

Exploring Open Source Big Data and AI Technologies

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.