Big Data 24 min read

Delta Lake 2.0, Iceberg, Hudi: A Comparative Study and the Arctic Lakehouse Service

The article reviews recent developments in data‑lake table formats—Delta Lake 2.0, Iceberg, and Hudi—examining their features, benchmark results, and ecosystem impact, and then introduces Arctic, an open‑source streaming lakehouse service built on Iceberg that aims to bridge batch‑stream gaps for enterprises.

DataFunTalk

Aug 10, 2022

Delta Lake 2.0, Iceberg, Hudi: A Comparative Study and the Arctic Lakehouse Service

Recent industry news highlighted the open‑source release of Delta Lake 2.0, which sparked discussion after Databricks published a performance comparison that positioned Delta against Iceberg and Hudi.

Table formats define the files that constitute a table, enabling any engine to read and write data consistently while supporting ACID guarantees and schema evolution. The three dominant open‑source formats—Delta, Iceberg, and Hudi—each have distinct histories and design goals.

Delta was initiated by Databricks in 2017 and open‑sourced in 2019. It was created to address the shortcomings of traditional data lakes in transaction handling, streaming, and BI analytics, promoting the “lakehouse” concept that unifies batch and real‑time workloads on Spark.

Iceberg , originated at Netflix and graduated to Apache in 2020, emphasizes data‑skipping, efficient planning, S3‑friendly design, and robust schema evolution. It is widely adopted by Cloudera, Snowflake, StarRocks, and Amazon Athena.

Hudi started as a Hadoop upsert and incremental processing library and has evolved into a broader platform with CDC, streaming upserts, and a self‑managing database layer.

Benchmark tests conducted by the author’s team using a Trino‑based TPCH workload showed Delta 2.0 to be 1.7× faster than Iceberg and 4.3× faster than Hudi under default settings. However, differences in compression (SNAPPY vs. ZSTD) and read‑target‑size (32 MiB vs. 128 MiB) explained much of the gap; equalizing these parameters removed the performance advantage.

The article then asks a strategic question: “What kind of data lake does an enterprise really need?” It argues that a lakehouse should provide a unified storage layer that supports both batch and streaming, offers engine‑agnostic access, and integrates with data‑ops practices.

To address these needs, the team released Arctic , an open‑source streaming lakehouse service built on Iceberg. Arctic offers self‑optimizing capabilities, dual compatibility with Hive and Iceberg tables, multi‑engine concurrent writes with primary‑key conflict resolution, standardized metrics, and a Thrift API for management.

Performance testing of Arctic (using a custom HTAP benchmark based on chbenchmark) demonstrated competitive read‑time‑merge latency compared with Hudi, with sub‑minute data freshness.

The article concludes that the standardization of table formats is accelerating, and enterprises should consider adopting lakehouse technologies—such as Delta, Iceberg, or Arctic—to break the batch‑stream divide, improve data quality, and lower operational costs.

For more details, the author provides links to Arctic documentation, the GitHub repository, and an upcoming online presentation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Open-source Benchmark Data Lake Iceberg Lakehouse Hudi Delta Lake

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.