Comparative Analysis of Hudi, Iceberg, and Delta Lake for Data Lake Storage
This article compares three open‑source data‑lake storage layers—Hudi, Iceberg, and Delta Lake—examining their shared reliance on meta‑files for schema and transaction handling, and detailing their differing designs for upserts, streaming support, query performance, and ecosystem integration.
Author: Xin Yong, EMR Technical Expert at Alibaba Computing Platform Division, contributor to Apache Hadoop and Apache Spark, with deep research on Hadoop, Spark, Hive, Druid and current work on big‑data cloudification.
Common Points
Qualitatively, all three are intermediate layers of Data Lake storage whose data‑management functions rely on a series of meta files. These meta files act like a database catalog/WAL, providing schema management, transaction management, and data management. Unlike traditional databases, the meta files reside alongside the data files in the storage engine and are directly visible to users, inheriting the big‑data tradition of user‑visible data but also increasing the risk of accidental damage. Deleting a meta directory can corrupt a table and make recovery difficult.
The meta files contain the table’s schema information, enabling the system to track schema evolution and provide ACID and multi‑version support through transaction logs. Every table change generates a new meta file, allowing history access. In these aspects, the three systems are similar.
Hudi
Hudi (Hadoop Upserts Deletes and Incrementals) focuses on upserts, deletes, and incremental processing. It offers three write modes via Spark HudiDataSource API and DeltaStreamer: UPSERT, INSERT, and BULK_INSERT. Deletions are supported through write‑time options rather than a pure delete API.
Typical usage streams upstream data from Kafka or Sqoop into Hudi via DeltaStreamer, a long‑running service that pulls data in batches and can trigger small‑file compaction automatically. Hudi supports Hive, Spark, and Presto for queries.
Performance relies on HoodieKey (a primary‑key‑like identifier) with min/max statistics and Bloom filters for fast record location. Upserts check the Bloom filter first; if the key is absent, an insert occurs, otherwise an update is performed, avoiding full‑table joins.
Hudi provides both Copy‑On‑Write (writes merge data, slightly slower writes but faster reads) and Merge‑On‑Read (merges at read time, enabling near‑real‑time analytics). It also includes a run_sync_tool script to sync schema to Hive and a command‑line tool for table management.
Iceberg
Iceberg does not use a HoodieKey design and does not emphasize a primary key. Updates, deletes, and merges must be implemented via joins, requiring an SQL execution engine. Iceberg does not bind to a specific engine; it supports Spark and Presto for queries but lacks native delete/update APIs. Users typically perform overwrites on affected partitions for updates.
Iceberg’s query performance benefits from the hidden‑partition feature, where users can transform columns (e.g., hour(timestamp)) to create invisible partition columns used for data organization and pruning. It also collects extensive column statistics (size, value count, null count, min/max) for data skipping.
Table creation is done through an API that specifies name, schema, and partition information, registering the table in a Hive catalog.
Delta
Delta targets a unified streaming‑batch Data Lake layer, supporting update/delete/merge. Originating from Databricks, it integrates tightly with Spark for all write modes (batch, streaming, SQL INSERT/INSERT OVERWRITE). Like Iceberg, Delta does not emphasize a primary key; updates are implemented via Spark joins.
Delta’s write path is strongly coupled with Spark, unlike Hudi which can use Spark or its own tools. For queries, open‑source Delta supports Spark and Presto, but Presto requires a preceding Spark job to generate a SymlinkTextInputFormat file. EMR has added a DeltaInputFormat to allow direct Presto queries without Spark.
Performance‑wise, open‑source Delta lacks many optimizations (e.g., hidden partitions, column stats). Databricks adds Data Skipping and file‑level caching, but the open version does not. EMR is working on closing these gaps.
Despite some shortcomings, Delta’s strengths lie in its deep Spark integration, unified streaming‑batch design, and support for Lambda/Kappa architecture improvements, making it flexible for analytics, machine learning, and CDC scenarios.
Conclusion
The three engines serve different primary scenarios: Hudi focuses on incremental upserts, Iceberg on high‑performance analytics and reliable data management, and Delta on unified streaming‑batch processing. Their design differences reflect these goals, though all are evolving and may converge over time.
The table below summarizes capabilities of each engine as of the end of 2019.
Delta
Hudi
Iceberg
Incremental Ingestion
Spark
Spark
Spark
ACID updates
HDFS, S3 (Databricks), OSS
HDFS
HDFS, S3
Upserts/Delete/Merge/Update
Delete/Merge/Update
Upserts/Delete
No
Streaming sink
Yes
Yes
Yes (not ready?)
Streaming source
Yes
No
No
File Formats
Parquet
Avro, Parquet
Parquet, ORC
Data Skipping
File‑Level Max‑Min stats + Z‑Ordering (Databricks)
File‑Level Max‑Min stats + Bloom Filter
File‑Level Max‑Min Filtering
Concurrency control
Optimistic
Optimistic
Optimistic
Data Validation
Yes (Databricks)
No
Yes
Merge on read
No
Yes
No
Schema Evolution
Yes
Yes
Yes
File I/O Cache
Yes (Databricks)
No
No
Cleanup
Manual
Automatic
No
Compaction
Manual
Automatic
No
Note: The author acknowledges possible inaccuracies and welcomes feedback.
If you enjoyed this article, please click “Like” and share—it’s a great encouragement.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology Architecture
Exploring Open Source Big Data and AI Technologies
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
