Big Data 43 min read

Apache Hudi: Core Concepts, Architecture, Storage Types, Write Operations, Querying, and Management

This article provides a comprehensive guide to Apache Hudi, covering its basic concepts, timeline architecture, storage types (Copy‑On‑Write and Merge‑On‑Read), write operations, DeltaStreamer usage, Hive/Spark/Presto query integration, data management, indexing, compaction, and best‑practice recommendations for big‑data lake workloads.

Big Data Technology & Architecture
Big Data Technology & Architecture
Big Data Technology & Architecture
Apache Hudi: Core Concepts, Architecture, Storage Types, Write Operations, Querying, and Management

1. Hudi Basic Concepts

Apache Hudi (pronounced “Hoodie”) adds streaming primitives on top of DFS data sets, primarily INSERT/UPDATE and INCREMENTAL PULL . These primitives enable efficient upserts and change‑capture queries.

Timeline

Hudi maintains a timeline that records every operation on the dataset with three key attributes: operation type , instant time (a monotonically increasing timestamp), and state . The timeline guarantees atomicity and consistency of operations.

COMMITS – atomic write of a batch of records.

CLEANS – background removal of obsolete file versions.

DELTA_COMMIT – incremental commit for Merge‑On‑Read tables.

COMPACTION – merges log files into columnar files.

ROLLBACK – aborts a failed commit.

SAVEPOINT – marks file groups to protect them from cleaning.

Each instant can be in one of three states: REQUESTED, INFLIGHT, or COMPLETED.

File Organization

Datasets are partitioned into directories similar to Hive tables. Within each partition, files are grouped into file groups identified by a unique file ID. A file group contains one base column file (Parquet) and zero or more log files that capture incremental updates.

Storage Types and Views

Hudi supports two storage types:

Copy‑On‑Write (COW) : only column files are stored; each commit rewrites whole files for updates.

Merge‑On‑Read (MOR) : writes create log files (Avro) that are later compacted into column files, providing near‑real‑time reads.

Corresponding query views are:

Read‑Optimized View – reads the latest snapshot of base files.

Incremental View – reads only data written after a given instant.

Realtime View – merges base and log files on‑the‑fly for low‑latency access.

2. Writing to Hudi

Write Operations

Three primary write operations are supported:

UPSERT – default; records are upserted based on the record key.

INSERT – skips index lookup; useful when duplicate elimination is not required.

BULK_INSERT – optimized for initial bulk loads; writes are sorted for scalability.

DeltaStreamer

The HoodieDeltaStreamer utility ingests data from sources such as Kafka, DFS, or Sqoop and supports JSON, Avro, and custom schemas. Example command‑line options:

<code style="padding:16px;display:block;">[hoodie]$ spark-submit --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
    packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-*.jar \
    --op UPSERT \
    --payload-class org.apache.hudi.OverwriteWithLatestAvroPayload \
    --source-class org.apache.hudi.utilities.sources.JsonDFSSource \
    --target-base-path file:///tmp/hudi-deltastreamer-op \
    --target-table my_hudi_table</code>

DeltaStreamer can also read from Kafka and supports schema registry integration.

Datasource Writer

Using the Spark DataSource API, a DataFrame can be written to Hudi as follows:

<code style="padding:16px;display:block;">inputDF.write()
    .format("org.apache.hudi")
    .options(clientOpts)
    .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(), "_row_key")
    .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(), "partition")
    .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(), "timestamp")
    .option(HoodieWriteConfig.TABLE_NAME, tableName)
    .mode(SaveMode.Append)
    .save(basePath);
</code>

Hive Sync

Both DeltaStreamer and the DataSource writer can sync the Hudi table schema to Hive Metastore using the HiveSyncTool:

<code style="padding:16px;display:block;">cd hudi-hive
./run_sync_tool.sh --base-path /path/to/hudi --database default --table hudi_tbl --user hive_user --pass hive_pass --jdbc-url jdbc:hive2://host:10000</code>

Delete Operations

Hudi supports soft deletes (nullify fields) and hard deletes (remove records entirely) via custom payloads such as org.apache.hudi.EmptyHoodieRecordPayload:

<code style="padding:16px;display:block;">deleteDF.write()
    .format("org.apache.hudi")
    .option(DataSourceWriteOptions.PAYLOAD_CLASS_OPT_KEY(), "org.apache.hudi.EmptyHoodieRecordPayload")
    .save(basePath);
</code>

3. Querying Hudi

Hive

To query Hudi tables in Hive, add hudi-hadoop-mr-bundle-*.jar to HiveServer2’s auxiliary classpath and set:

<code style="padding:16px;display:block;">set hive.input.format=org.apache.hudi.hadoop.HoodieParquetInputFormat;
set hive.tez.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
</code>

Hive provides read‑optimized, realtime, and incremental views via the corresponding Hudi input formats.

Spark

Spark can access Hudi via the DataSource API or as a Hive table. For the read‑optimized view, set a path filter:

<code style="padding:16px;display:block;">spark.sparkContext.hadoopConfiguration.setClass(
    "mapreduce.input.pathFilter.class",
    classOf[org.apache.hudi.hadoop.HoodieROTablePathFilter],
    classOf[org.apache.hadoop.fs.PathFilter]);

Dataset<Row> hoodieROViewDF = spark.read()
    .format("org.apache.hudi")
    .load("/path/to/hudi/table");
</code>

Realtime view requires disabling Hive‑Metastore conversion:

<code style="padding:16px;display:block;">spark.sql.hive.convertMetastoreParquet = false;</code>

Incremental queries are expressed with options:

<code style="padding:16px;display:block;">Dataset<Row> incDF = spark.read()
    .format("org.apache.hudi")
    .option(DataSourceReadOptions.VIEW_TYPE_OPT_KEY(), DataSourceReadOptions.VIEW_TYPE_INCREMENTAL_OPT_VAL())
    .option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY(), "20230101000000")
    .load(basePath);
</code>

Presto

Place hudi-presto-bundle.jar in Presto’s plugin/hive-hadoop2 directory to query the read‑optimized view directly.

4. Hudi Management & Operations

Storage Management

Hudi mitigates small‑file problems by grouping inserts, configuring cleaners, and tuning base‑file and log‑file sizes. Compaction (inline or asynchronous) merges log files into column files for MOR tables.

Indexing

Indexes map record keys to file groups, accelerating upserts. Supported indexes include:

HoodieBloomIndex (default)

HoodieGlobalBloomIndex

HBaseIndex

Cleaner

The cleaner removes obsolete file versions after commits. The retention can be tuned via hoodie.cleaner.commits.retained.

Schema Evolution

Hudi stores records as Avro, allowing forward‑compatible schema evolution. As long as new fields are additive, Hudi seamlessly reads and writes older and newer data.

Compaction & Performance

Copy‑On‑Write offers fast read‑optimized queries at the cost of heavier writes. Merge‑On‑Read provides low‑latency writes and near‑real‑time reads but requires periodic compaction. Performance tuning includes file‑size limits, log‑file size, and parallelism settings.

Best Practices

Choose COW for simple replacement of existing Parquet tables or when write latency is not critical.

Choose MOR for workloads needing fast ingestion and near‑real‑time visibility.

Configure cleaners and compaction to balance storage cost and query freshness.

Leverage HiveSync for downstream analytics tools.

Overall, Hudi provides snapshot isolation, atomic batch writes, incremental pull, and deduplication, making it a powerful framework for building efficient big‑data lakes.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataHiveData LakeSparkApache HudiIncremental ProcessingCopy-on-WriteMerge-on-Read
Big Data Technology & Architecture
Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.