Big Data 16 min read

Analyzing Lakehouse Storage Systems: Metadata, Merge‑On‑Read, and Performance Optimizations for Delta Lake, Hudi, and Iceberg

This article examines the design of lakehouse storage systems by comparing Delta Lake, Apache Hudi, and Apache Iceberg, focusing on metadata management, Merge‑On‑Read mechanisms, and a series of query and write performance optimizations with real‑world EMR case studies.

DataFunTalk
DataFunTalk
DataFunTalk
Analyzing Lakehouse Storage Systems: Metadata, Merge‑On‑Read, and Performance Optimizations for Delta Lake, Hudi, and Iceberg

The article introduces lakehouse storage systems as a new data management paradigm that combines the high performance of data warehouses with the low‑cost openness of data lakes, built around open formats such as Delta Lake, Apache Iceberg, and Apache Hudi.

It then dives into two core design aspects:

1. Metadata – Describes how each format persists schema, configuration, and the list of valid data files in the file system, detailing Delta Lake’s JSON log and checkpoint parquet, Iceberg’s three‑layer metadata (metadata, manifest list, manifest), and Hudi’s file‑group and property‑based metadata.

2. Merge‑On‑Read (MOR) – Explains the write‑amplification problem of Copy‑On‑Write tables and how Hudi pioneered MOR, while Delta Lake uses Deletion Vectors and Iceberg V2 provides position‑based MOR, comparing their read‑time merging strategies.

The article then presents performance optimization techniques, organized into three parts:

(1) Query Optimization – Discusses metadata loading (single‑node vs. distributed), plan optimization using statistics, adaptive query execution, and cost‑based optimization, and shows benchmark results from the LHBench suite.

(2) Plan Optimization – Highlights how Spark leverages table‑level and column‑level statistics to reorder joins, push down predicates, and apply cost models.

(3) Table Scan and Write Optimization – Covers file‑size tuning, small‑file merging, data‑skipping, Z‑order, Bloom filters, and the impact of vectorized reads; also describes write paths, including update handling, compaction, and table services such as clean and checkpoint.

Two real‑world EMR case studies are shared: (a) an EMR Manifest solution that pre‑computes partition‑level manifests to reduce metadata loading time from 90 seconds to sub‑second latency, and (b) an EMR DataLake Metastore that centralizes metadata for Hudi and Iceberg, enabling versioned snapshots and advanced pruning.

Finally, the article concludes with a summary of key takeaways and thanks the audience.

Performance OptimizationBig DatametadataApache IcebergLakehouseApache HudiDelta Lake
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.