Apache Hudi: Core Concepts, Architecture, Storage Types, Write Operations, Querying, and Management
This article provides a comprehensive guide to Apache Hudi, covering its basic concepts, timeline architecture, storage types (Copy‑On‑Write and Merge‑On‑Read), write operations, DeltaStreamer usage, Hive/Spark/Presto query integration, data management, indexing, compaction, and best‑practice recommendations for big‑data lake workloads.
1. Hudi Basic Concepts
Apache Hudi (pronounced “Hoodie”) adds streaming primitives on top of DFS data sets, primarily INSERT/UPDATE and INCREMENTAL PULL . These primitives enable efficient upserts and change‑capture queries.
Timeline
Hudi maintains a timeline that records every operation on the dataset with three key attributes: operation type , instant time (a monotonically increasing timestamp), and state . The timeline guarantees atomicity and consistency of operations.
COMMITS – atomic write of a batch of records.
CLEANS – background removal of obsolete file versions.
DELTA_COMMIT – incremental commit for Merge‑On‑Read tables.
COMPACTION – merges log files into columnar files.
ROLLBACK – aborts a failed commit.
SAVEPOINT – marks file groups to protect them from cleaning.
Each instant can be in one of three states: REQUESTED, INFLIGHT, or COMPLETED.
File Organization
Datasets are partitioned into directories similar to Hive tables. Within each partition, files are grouped into file groups identified by a unique file ID. A file group contains one base column file (Parquet) and zero or more log files that capture incremental updates.
Storage Types and Views
Hudi supports two storage types:
Copy‑On‑Write (COW) : only column files are stored; each commit rewrites whole files for updates.
Merge‑On‑Read (MOR) : writes create log files (Avro) that are later compacted into column files, providing near‑real‑time reads.
Corresponding query views are:
Read‑Optimized View – reads the latest snapshot of base files.
Incremental View – reads only data written after a given instant.
Realtime View – merges base and log files on‑the‑fly for low‑latency access.
2. Writing to Hudi
Write Operations
Three primary write operations are supported:
UPSERT – default; records are upserted based on the record key.
INSERT – skips index lookup; useful when duplicate elimination is not required.
BULK_INSERT – optimized for initial bulk loads; writes are sorted for scalability.
DeltaStreamer
The HoodieDeltaStreamer utility ingests data from sources such as Kafka, DFS, or Sqoop and supports JSON, Avro, and custom schemas. Example command‑line options:
<code style="padding:16px;display:block;">[hoodie]$ spark-submit --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-*.jar \
--op UPSERT \
--payload-class org.apache.hudi.OverwriteWithLatestAvroPayload \
--source-class org.apache.hudi.utilities.sources.JsonDFSSource \
--target-base-path file:///tmp/hudi-deltastreamer-op \
--target-table my_hudi_table</code>DeltaStreamer can also read from Kafka and supports schema registry integration.
Datasource Writer
Using the Spark DataSource API, a DataFrame can be written to Hudi as follows:
<code style="padding:16px;display:block;">inputDF.write()
.format("org.apache.hudi")
.options(clientOpts)
.option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(), "_row_key")
.option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(), "partition")
.option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(), "timestamp")
.option(HoodieWriteConfig.TABLE_NAME, tableName)
.mode(SaveMode.Append)
.save(basePath);
</code>Hive Sync
Both DeltaStreamer and the DataSource writer can sync the Hudi table schema to Hive Metastore using the HiveSyncTool:
<code style="padding:16px;display:block;">cd hudi-hive
./run_sync_tool.sh --base-path /path/to/hudi --database default --table hudi_tbl --user hive_user --pass hive_pass --jdbc-url jdbc:hive2://host:10000</code>Delete Operations
Hudi supports soft deletes (nullify fields) and hard deletes (remove records entirely) via custom payloads such as org.apache.hudi.EmptyHoodieRecordPayload:
<code style="padding:16px;display:block;">deleteDF.write()
.format("org.apache.hudi")
.option(DataSourceWriteOptions.PAYLOAD_CLASS_OPT_KEY(), "org.apache.hudi.EmptyHoodieRecordPayload")
.save(basePath);
</code>3. Querying Hudi
Hive
To query Hudi tables in Hive, add hudi-hadoop-mr-bundle-*.jar to HiveServer2’s auxiliary classpath and set:
<code style="padding:16px;display:block;">set hive.input.format=org.apache.hudi.hadoop.HoodieParquetInputFormat;
set hive.tez.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
</code>Hive provides read‑optimized, realtime, and incremental views via the corresponding Hudi input formats.
Spark
Spark can access Hudi via the DataSource API or as a Hive table. For the read‑optimized view, set a path filter:
<code style="padding:16px;display:block;">spark.sparkContext.hadoopConfiguration.setClass(
"mapreduce.input.pathFilter.class",
classOf[org.apache.hudi.hadoop.HoodieROTablePathFilter],
classOf[org.apache.hadoop.fs.PathFilter]);
Dataset<Row> hoodieROViewDF = spark.read()
.format("org.apache.hudi")
.load("/path/to/hudi/table");
</code>Realtime view requires disabling Hive‑Metastore conversion:
<code style="padding:16px;display:block;">spark.sql.hive.convertMetastoreParquet = false;</code>Incremental queries are expressed with options:
<code style="padding:16px;display:block;">Dataset<Row> incDF = spark.read()
.format("org.apache.hudi")
.option(DataSourceReadOptions.VIEW_TYPE_OPT_KEY(), DataSourceReadOptions.VIEW_TYPE_INCREMENTAL_OPT_VAL())
.option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY(), "20230101000000")
.load(basePath);
</code>Presto
Place hudi-presto-bundle.jar in Presto’s plugin/hive-hadoop2 directory to query the read‑optimized view directly.
4. Hudi Management & Operations
Storage Management
Hudi mitigates small‑file problems by grouping inserts, configuring cleaners, and tuning base‑file and log‑file sizes. Compaction (inline or asynchronous) merges log files into column files for MOR tables.
Indexing
Indexes map record keys to file groups, accelerating upserts. Supported indexes include:
HoodieBloomIndex (default)
HoodieGlobalBloomIndex
HBaseIndex
Cleaner
The cleaner removes obsolete file versions after commits. The retention can be tuned via hoodie.cleaner.commits.retained.
Schema Evolution
Hudi stores records as Avro, allowing forward‑compatible schema evolution. As long as new fields are additive, Hudi seamlessly reads and writes older and newer data.
Compaction & Performance
Copy‑On‑Write offers fast read‑optimized queries at the cost of heavier writes. Merge‑On‑Read provides low‑latency writes and near‑real‑time reads but requires periodic compaction. Performance tuning includes file‑size limits, log‑file size, and parallelism settings.
Best Practices
Choose COW for simple replacement of existing Parquet tables or when write latency is not critical.
Choose MOR for workloads needing fast ingestion and near‑real‑time visibility.
Configure cleaners and compaction to balance storage cost and query freshness.
Leverage HiveSync for downstream analytics tools.
Overall, Hudi provides snapshot isolation, atomic batch writes, incremental pull, and deduplication, making it a powerful framework for building efficient big‑data lakes.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
