Delta Lake 2.0, Iceberg, Hudi: A Comparative Study and the Arctic Lakehouse Service
The article reviews recent developments in data‑lake table formats—Delta Lake 2.0, Iceberg, and Hudi—examining their features, benchmark results, and ecosystem impact, and then introduces Arctic, an open‑source streaming lakehouse service built on Iceberg that aims to bridge batch‑stream gaps for enterprises.
Recent industry news highlighted the open‑source release of Delta Lake 2.0, which sparked discussion after Databricks published a performance comparison that positioned Delta against Iceberg and Hudi.
Table formats define the files that constitute a table, enabling any engine to read and write data consistently while supporting ACID guarantees and schema evolution. The three dominant open‑source formats—Delta, Iceberg, and Hudi—each have distinct histories and design goals.
Delta was initiated by Databricks in 2017 and open‑sourced in 2019. It was created to address the shortcomings of traditional data lakes in transaction handling, streaming, and BI analytics, promoting the “lakehouse” concept that unifies batch and real‑time workloads on Spark.
Iceberg , originated at Netflix and graduated to Apache in 2020, emphasizes data‑skipping, efficient planning, S3‑friendly design, and robust schema evolution. It is widely adopted by Cloudera, Snowflake, StarRocks, and Amazon Athena.
Hudi started as a Hadoop upsert and incremental processing library and has evolved into a broader platform with CDC, streaming upserts, and a self‑managing database layer.
Benchmark tests conducted by the author’s team using a Trino‑based TPCH workload showed Delta 2.0 to be 1.7× faster than Iceberg and 4.3× faster than Hudi under default settings. However, differences in compression (SNAPPY vs. ZSTD) and read‑target‑size (32 MiB vs. 128 MiB) explained much of the gap; equalizing these parameters removed the performance advantage.
The article then asks a strategic question: “What kind of data lake does an enterprise really need?” It argues that a lakehouse should provide a unified storage layer that supports both batch and streaming, offers engine‑agnostic access, and integrates with data‑ops practices.
To address these needs, the team released Arctic , an open‑source streaming lakehouse service built on Iceberg. Arctic offers self‑optimizing capabilities, dual compatibility with Hive and Iceberg tables, multi‑engine concurrent writes with primary‑key conflict resolution, standardized metrics, and a Thrift API for management.
Performance testing of Arctic (using a custom HTAP benchmark based on chbenchmark) demonstrated competitive read‑time‑merge latency compared with Hudi, with sub‑minute data freshness.
The article concludes that the standardization of table formats is accelerating, and enterprises should consider adopting lakehouse technologies—such as Delta, Iceberg, or Arctic—to break the batch‑stream divide, improve data quality, and lower operational costs.
For more details, the author provides links to Arctic documentation, the GitHub repository, and an upcoming online presentation.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.