Apache Hudi: Architecture, Uber’s Use Cases, Improvements, and Future Roadmap
This article explains the design of Apache Hudi, its core concepts such as upserts and incremental pulls, how Uber leverages it for large‑scale data‑lake operations, the enhancements made over time, and the project’s future plans within the Apache ecosystem.
From ensuring accurate estimated arrival times to predicting optimal travel routes, Uber needed a reliable, high‑performance, large‑scale data storage and analytics solution; in 2016 Uber built the incremental processing framework Apache Hudi, open‑sourced it a year later, and donated it to the Apache Software Foundation in 2019, where it eventually graduated to a top‑level project.
Apache Hudi is a storage‑abstraction framework that helps organizations build and manage petabyte‑scale data lakes using primitives such as upserts and incremental pulls, bringing stream‑processing capabilities to batch‑style big data and running on HDFS or cloud storage.
Hudi provides ACID semantics on data lakes. Its two most widely used features—upserts and incremental pulls—allow users to capture change data and apply it to the lake, supported by pluggable indexing mechanisms and multiple query engines like Presto, Hive, Spark, and Impala.
Figure 1. Apache Hudi provides different table views to ingest change logs, events, and incremental streams for various application scenarios.
Conceptually Hudi consists of three main components: the raw data to be stored, index data that enables upserts, and metadata that manages the dataset. Internally it maintains a timeline of actions (called instants) that gives atomic, time‑consistent views of the table and supports real‑time, read‑optimized, and incremental query views.
Data is organized under a base path with partitions, file groups, and file slices containing Parquet data files and accompanying log files. Hudi uses MVCC; compaction merges logs into new Parquet files, while cleaning removes obsolete slices to reclaim space.
Hudi supports two table types: Copy‑on‑Write (COW) which stores data only in columnar files and rewrites whole files on updates, and Merge‑on‑Read (MOR) which stores a combination of columnar (Parquet) and row‑based (Avro) files, applying updates incrementally and compacting them asynchronously.
It also offers snapshot queries (reading the latest consistent view) and incremental queries (reading changes since a given commit), with COW providing fast column‑file performance and MOR delivering near‑real‑time data through dynamic merging.
At Uber, Hudi is used in many scenarios—from providing fast, accurate trip data for the ride‑hailing platform to fraud detection and restaurant recommendations on UberEats. Before Hudi, large Spark jobs rewrote entire datasets (over 20 TB with 1,000 executors) every few minutes, which was inefficient and hard to scale.
In late 2016 Uber built the first generation of Hudi using the COW table type, achieving a 20 GB processing speed per job and reducing I/O and write amplification by a factor of 100; by the end of 2017 all of Uber’s raw tables had migrated to Hudi, forming one of the world’s largest transactional data lakes.
Figure 2. Hudi’s copy‑on‑write capability enables file‑level updates, dramatically improving data freshness.
As Uber’s data volume grew, the limitations of COW (high write amplification, large rewrites, and HDFS NameNode pressure) became apparent. To address this, Uber introduced the MOR table type, deploying three jobs: an ingest job for inserts/updates/deletes, a minor compaction job that asynchronously merges recent updates, and a major compaction job that slowly compacts older partitions. This model delivers fresh columnar data for thousands of queries while keeping merge costs low.
Figure 3. Uber’s Hudi team devised a compression strategy for MOR tables that frequently converts recent partitions to columnar storage, reducing query‑side compute cost.
Today Hudi powers over 150 PB of data, ingesting more than 5 × 10¹¹ records daily, serving over 1 million queries per week across 10 000+ tables and thousands of pipelines.
When Uber open‑sourced Hudi in 2017, the team highlighted several key considerations: improving storage and processing efficiency, ensuring high‑quality tables in the lake, delivering low‑latency data at scale, and providing a unified service layer for minute‑level use cases.
Without proper standardization and primitives, a data lake quickly turns into a “data swamp,” requiring costly cleanup and complex algorithms. Hudi addresses these challenges by seamlessly ingesting and managing large analytical datasets on distributed file systems.
Beyond Uber, many companies—including Alibaba Cloud, Tencent Cloud, AWS, and Udemy—have adopted Apache Hudi in production.
Future plans include contributing new features such as intelligent metadata for column indexes and O(1) query planning, efficient Parquet table bootstrapping, and record‑level indexing to accelerate inserts, as well as further optimizations for storage management and query performance based on access patterns.
As Hudi graduates to a top‑level Apache project, the community looks forward to a roadmap that enhances speed, reliability, and transactional capabilities for data lakes, enabling richer, portable data applications.
Apache Hudi is an evolving open‑source project; contributions are welcome.
Past Recommendations
▬
Data Lake | A Complete Guide to Concepts, Features, Architecture, and Cases
Announcement! Apache Software Foundation Announces Apache Hudi as a Top‑Level Project
Understanding Delta Lake: Evolving Data Warehouses into Data Lakes
Kafka Core Knowledge Summary with Mind MapBig Data Technology Architecture
Exploring Open Source Big Data and AI Technologies
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.