Comparison of Hudi, Iceberg, and Delta Lake Table Formats
This article compares the design goals, data‑lake table formats—Hudi, Iceberg, and Delta—highlighting their common reliance on meta files and their distinct strengths for upserts, analytics, and unified streaming‑batch processing in modern big‑data environments.
All three technologies—Hudi, Iceberg, and Delta—serve as a middle‑layer storage in a Data Lake and manage data through a set of meta files. These meta files act like a catalog/WAL, storing schema, transaction logs, and versioning information alongside the data files, making them visible to users but also vulnerable to accidental deletion.
The meta files contain table schema, enabling the system to handle schema evolution and provide ACID guarantees. Each change creates a new meta file, allowing multi‑version support and historical data access.
Hudi
Hudi (Hadoop Upserts Deletes and Incrementals) focuses on upserts, deletes, and incremental processing. It provides three write modes—UPSERT, INSERT, and BULK_INSERT—through the Spark HudiDataSource API and its own DeltaStreamer service, which can ingest data from Kafka or Sqoop and automatically merge small files.
Hudi supports query engines such as Hive, Spark, and Presto. Its performance relies on the HoodieKey structure, which includes Min/Max statistics and a BloomFilter for fast record location. Upserts use a HoodieKey + BloomFilter check to decide whether to insert or update, avoiding full‑table joins. Hudi also offers two storage types: Copy‑On‑Write (higher read performance) and Merge‑On‑Read (near‑real‑time writes). Additional tools include run_sync_tool for syncing schemas to Hive and a command‑line utility for table management.
Iceberg
Iceberg does not use a HoodieKey design and does not emphasize primary keys. Updates, deletes, and merges are performed via joins, requiring an external execution engine. Iceberg is engine‑agnostic; it provides a Spark DataFrame API for writes and supports streaming writes through StreamWriteSupport, though documentation on streaming is limited.
Iceberg queries are supported by Spark and Presto. It features a “hidden partition” mechanism where users can define transformed columns (e.g., hour(timestamp)) that are used for partition pruning without appearing in the table schema. Iceberg also collects extensive column statistics (size, value count, null count, min/max) to aid query pruning. Table creation is done via an API that specifies name, schema, and partitioning, and the table is registered in a Hive catalog.
Delta
Delta Lake positions itself as a unified streaming‑and‑batch storage layer, supporting update, delete, and merge operations. Originating from Databricks, it tightly integrates with Spark; all write modes—including batch DataFrame writes, streaming writes, and SQL INSERT/INSERT OVERWRITE—are supported. Like Iceberg, Delta does not enforce primary keys; updates are implemented via Spark joins.
Open‑source Delta currently supports Spark and Presto. Presto queries require a Spark job to generate a SymlinkTextInputFormat file before reading, although EMR adds a DeltaInputFormat to avoid this extra step. The open‑source version lacks many query‑time optimizations (e.g., hidden partitions or column stats), which Databricks’ proprietary version addresses with Data Skipping. Delta’s main advantage is its deep Spark integration, enabling multi‑hop pipelines for analytics, machine learning, and CDC, and it claims to simplify Lambda/Kappa architectures.
Conclusion
The three engines target different primary scenarios: Hudi excels at incremental upserts, Iceberg focuses on high‑performance analytics and robust data management, and Delta aims for seamless streaming‑batch processing. Over time each project is adding missing capabilities, so future convergence is possible, but their distinct strengths may also keep them specialized.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
