Key Requirements for Building PB‑Scale Data Lakes and How Apache Hudi Meets Them
The article outlines the essential requirements for constructing petabyte‑scale data lakes—such as incremental CDC ingestion, log deduplication, storage management, ACID transactions, fast ETL, and compliance—and explains how Apache Hudi’s COW and Merge‑on‑Read architectures, async compaction, and advanced features address each need.
A recent talk by Hudi PMC and Uber Senior Engineering Manager Nishith Agarwal introduced the fundamentals of data lakes, which are centralized repositories that store raw structured and unstructured data at any scale and support dashboards, big‑data analytics, real‑time analysis, and machine learning.
The talk highlighted six critical requirements for building petabyte‑scale data lakes:
Incremental ingestion (CDC) to avoid costly full‑load backups.
Log‑event deduplication to ensure correctness of massive time‑series streams.
Storage management that mitigates small‑file problems and balances write latency.
Transactional (ACID) writes for atomic upserts, rollbacks, and repeatable reads.
Fast derivation/ETL of incremental data to keep downstream tables fresh.
Legal compliance and efficient delete/update capabilities for data‑retention regulations.
Apache Hudi (Hadoop Upserts Deletes and Incrementals) is presented as a system that satisfies all these requirements. Hudi supports three query views—incremental, snapshot, and real‑time—and offers two storage types: Copy‑On‑Write (COW) and Merge‑On‑Read (MOR).
In COW, Hudi writes data to Parquet files and uses a commit timeline with an "inflight" marker to guarantee atomic commits. While COW provides strong read performance, heavy cross‑partition updates can cause write amplification and latency.
MOR groups updates into separate delta files, allowing the real‑time view to merge base Parquet data with these deltas at query time, thus reducing ingestion latency at the cost of slightly more complex reads.
To further reduce write amplification, Hudi introduces asynchronous compaction, which merges delta files into base files in the background while ingestion continues.
Hudi also brings ACID guarantees to large‑scale data processing, maintaining a hidden .hoodie folder with monotonically increasing timestamps, supporting atomic multi‑row, multi‑partition commits, MVCC isolation, and automatic rollback of partial failures.
At Uber, Hudi manages over 150 PB of data across more than 10 000 tables, ingesting 5 × 10¹¹ records daily. Its upsert primitives enable sub‑5‑minute data freshness for dashboards and support incremental views for handling data spikes.
Developer‑friendly tools include HoodieDeltaStreamer for near‑real‑time ingestion from DFS/Kafka, Spark datasource integration, and PySpark support. Advanced features such as savepoints, file versioning, and time‑travel queries (e.g., using hoodie.table_name.consume.end.timestamp ) enhance reliability and auditability.
Upcoming Hudi 0.6.0 will improve bulk import of existing Parquet tables, add O(1) query planning, column indexes, unified metadata management, row‑level indexing, async clustering, Presto snapshot queries for MOR tables, and Flink integration.
The talk concluded with an invitation to explore these capabilities further.
Big Data Technology Architecture
Exploring Open Source Big Data and AI Technologies
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.