Big Data 19 min read

Implementing Change Data Capture (CDC) on Data Lake Formats with Apache Hudi

This article reviews lake‑format concepts, Apache Hudi architecture, CDC fundamentals, design considerations for CDC on lake formats, implementation details of Hudi CDC, and streaming optimizations including automated lake‑table management and a simplified StreamingSQL for Spark.

Big Data Technology Architecture
Big Data Technology Architecture
Big Data Technology Architecture
Implementing Change Data Capture (CDC) on Data Lake Formats with Apache Hudi

This article, based on a presentation by Alibaba Cloud big‑data platform expert at Apache Con Asia, is organized into four parts: lake formats & Hudi & CDC, design considerations for implementing CDC on lake formats, Hudi CDC implementation, and streaming optimizations.

Lakehouse key features include ACID transactions, schema enforcement and evolution, support for structured and unstructured data, batch‑stream integration, and compute‑storage separation, which are enabled by underlying lake‑format data organization such as Delta Lake, Apache Hudi, and Apache Iceberg.

Apache Hudi is a data‑lake platform built on a self‑managed storage layer, offering fine‑grained file groups, multiple indexing strategies for upserts, and rich automated table‑management capabilities.

Change Data Capture (CDC) captures row‑level changes (insert, update, delete) with timestamps and before/after values; traditional methods include timestamps, table diffs, triggers, and transaction‑log parsing, each with distinct drawbacks.

Typical CDC scenarios require full row information, inclusion of old values for delete/update, and the ability to trace every change.

CDC output formats such as Debezium and Delta Lake’s custom format provide operation type, timestamp/version, and before/after values for downstream consumption.

Designing CDC for lake formats must consider multi‑version snapshots, file‑level metadata changes, lack of always‑on services, and performance impacts on reads and writes.

Two main approaches are table‑diff CDC (leveraging time‑travel queries) and Change Data Feed (CDF) as used by Delta Lake, which writes CDC data directly during commits, offering better query performance at the cost of additional write overhead.

For Hudi CDC, the write path involves HoodieWriteHandle subclasses; HoodieMergeHandler handles upserts where CDC data must be persisted. The CDC query uses CDCFileSplit objects to determine how to extract changes from persisted CDC files, new‑only files, or delete‑only files.

Enabling CDC in Hudi is done by setting the table property hoodie.table.cdc.enabled=true and configuring the query type:

hoodie.datasource.query.type = incremental
hoodie.datasource.query.incremental.format = cdc

The supplemental logging mode can be chosen with hoodie.table.cdc.supplemental.logging.name , defaulting to data_before_after , with alternatives op_key_only and data_before .

Future work includes full Flink and Spark CDC support, extending Spark SQL for CDC queries, and providing a flat CDC output format similar to Delta Lake.

Streaming optimizations address two challenges: (1) lake‑table management tasks (vacuum, clean, optimize, clustering) can block streaming jobs; Alibaba Cloud EMR’s Data Lake Formation automates these tasks outside the streaming workflow based on real‑time metrics. (2) Developing streaming jobs is complex; EMR extends Spark SQL with StreamingSQL, allowing declarative creation of streaming views and streams that consume Kafka data and perform MERGE‑INTO writes, greatly simplifying development and operations.

Big DataStreamingdata lakeApache HudiCDCDelta Lake
Big Data Technology Architecture
Written by

Big Data Technology Architecture

Exploring Open Source Big Data and AI Technologies

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.