Big Data 16 min read

Analyzing Lakehouse Storage Systems: Metadata, Merge‑On‑Read, and Performance Optimizations for Delta Lake, Hudi, and Iceberg

This article examines the design of lakehouse storage systems by comparing Delta Lake, Apache Hudi, and Apache Iceberg, focusing on metadata management, Merge‑On‑Read mechanisms, and a series of query and write performance optimizations with real‑world EMR case studies.

DataFunTalk

Jan 9, 2024

Analyzing Lakehouse Storage Systems: Metadata, Merge‑On‑Read, and Performance Optimizations for Delta Lake, Hudi, and Iceberg

The article introduces lakehouse storage systems as a new data management paradigm that combines the high performance of data warehouses with the low‑cost openness of data lakes, built around open formats such as Delta Lake, Apache Iceberg, and Apache Hudi.

It then dives into two core design aspects:

1. Metadata – Describes how each format persists schema, configuration, and the list of valid data files in the file system, detailing Delta Lake’s JSON log and checkpoint parquet, Iceberg’s three‑layer metadata (metadata, manifest list, manifest), and Hudi’s file‑group and property‑based metadata.

2. Merge‑On‑Read (MOR) – Explains the write‑amplification problem of Copy‑On‑Write tables and how Hudi pioneered MOR, while Delta Lake uses Deletion Vectors and Iceberg V2 provides position‑based MOR, comparing their read‑time merging strategies.

The article then presents performance optimization techniques, organized into three parts:

(1) Query Optimization – Discusses metadata loading (single‑node vs. distributed), plan optimization using statistics, adaptive query execution, and cost‑based optimization, and shows benchmark results from the LHBench suite.

(2) Plan Optimization – Highlights how Spark leverages table‑level and column‑level statistics to reorder joins, push down predicates, and apply cost models.

(3) Table Scan and Write Optimization – Covers file‑size tuning, small‑file merging, data‑skipping, Z‑order, Bloom filters, and the impact of vectorized reads; also describes write paths, including update handling, compaction, and table services such as clean and checkpoint.

Two real‑world EMR case studies are shared: (a) an EMR Manifest solution that pre‑computes partition‑level manifests to reduce metadata loading time from 90 seconds to sub‑second latency, and (b) an EMR DataLake Metastore that centralizes metadata for Hudi and Iceberg, enabling versioned snapshots and advanced pruning.

Finally, the article concludes with a summary of key takeaways and thanks the audience.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization Big Data Apache Iceberg Lakehouse Apache Hudi Delta Lake

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.