Big Data 6 min read

Near Real-Time Ingestion, Analysis, Incremental Pipelines, and Data Distribution with Apache Hudi

The article explains how Apache Hudi enables near‑real‑time data ingestion from various sources, supports low‑latency analytics, provides incremental processing pipelines, and simplifies data distribution on Hadoop, improving efficiency and reducing operational complexity.

Big Data Technology Architecture

May 21, 2020

Near Real-Time Ingestion, Analysis, Incremental Pipelines, and Data Distribution with Apache Hudi

1. Near Real-Time Ingestion Extracting data from external sources such as event logs or databases into a Hadoop data lake is common, but many deployments use ad‑hoc tools. Hudi accelerates RDBMS ingestion with upserts, allowing MySQL binlog or Sqoop incremental loads to be applied directly to Hudi tables, avoiding costly batch merges. For NoSQL stores like Cassandra, Voldemort, or HBase, bulk loads are impractical; Hudi’s approach matches ingestion speed with frequent updates. Even immutable sources like Kafka benefit from Hudi’s enforcement of minimum file sizes on DFS, protecting NameNode health for large event streams. Hudi also atomically publishes new data to consumers, preventing partial extraction failures.

2. Near Real-Time Analytics Traditional real‑time data marts (e.g., Druid, MemSQL, OpenTSDB) serve sub‑second queries on small datasets, but Hadoop’s latency makes them less suitable. Interactive SQL engines such as Presto and SparkSQL can answer queries in seconds. By shortening data freshness to minutes, Hudi offers an efficient alternative without external dependencies, enabling faster analysis of larger tables stored on DFS.

3. Incremental Processing Pipelines Hadoop workflows often rely on downstream jobs that wait for upstream data partitions (e.g., Hive partitions) to appear, introducing hour‑level latency. Hudi solves this by consuming new records at the record level rather than whole folders, allowing downstream Hudi tables ( HD) to process updates from upstream Hudi tables ( HU) every 15 minutes and achieve end‑to‑end latency of about 30 minutes. Hudi integrates with streaming frameworks (Spark Streaming), pub/sub systems (Kafka), and database replication tools (Oracle XStream) to provide incremental processing advantages over pure batch or stream approaches.

4. Data Distribution on DFS A typical pattern moves processed Hadoop data to online stores (e.g., ElasticSearch) via a queue such as Kafka, resulting in duplicate storage on DFS and Kafka. Hudi can replace this by inserting Spark pipeline updates into a Hudi table and then performing incremental reads—similar to consuming a Kafka topic—to feed downstream services, achieving a unified storage model.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Hadoop Apache Hudi Incremental Processing Real-time Ingestion

Written by

Big Data Technology Architecture

Exploring Open Source Big Data and AI Technologies

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.