Implementing CDC‑to‑Hudi for Real‑Time Mutable Data in a Big Data System
This article describes how Linkflow migrated mutable customer data from MySQL to an Apache Hudi data lake using Debezium‑in‑Flink CDC, addressing challenges such as snapshot resumability, partial updates, row‑key merging, schema evolution, indexing, and concurrent writes to achieve minute‑level data freshness and improved offline processing performance.
Mutable data handling is a long‑standing difficulty for large‑scale real‑time systems. After evaluating several solutions, Linkflow adopted a CDC‑to‑Hudi ingestion pipeline that delivers minute‑level data freshness in production.
1. Background
Linkflow is a Customer Data Platform (CDP) that collects massive amounts of data via SDKs and third‑party sources (WeChat, Weibo, etc.). Data is classified as immutable (facts) and mutable (dimensions) and stored in MySQL, which leads to fragmentation, difficult multi‑dimensional queries, and online DDL risks.
2. CDC and Data Lake
CDC (Change Data Capture) captures changes from MySQL binlog. After testing Canal, Maxwell, and Debezium, the team chose Debezium embedded in Flink via flink‑cdc‑connectors, enabling snapshot and incremental streams. Kafka is used as an ODS to route each table’s changes to separate topics, allowing replay and precise state reconstruction.
For storage, the team selected Apache Hudi because it provides upsert support on HDFS, incremental queries, Hive sync, and optimized COW/MOR modes.
The resulting architecture uses Kafka → Flink (CDC) → Hudi (COW tables) → Presto for ad‑hoc queries, with Spark Streaming handling the initial data load due to version constraints.
3. Technical Challenges
3.1 CDC Run‑Mode Customization
Full‑snapshot mode: Large tables cause long snapshot times; to avoid full re‑snapshot on failure, the team used the snapshot.include.collection.list parameter to resume from a specific list of tables.
An optional, comma‑separated list of regular expressions that match names of schemas specified in table.include.list for which you want to take the snapshot.
They also introduced an INITIAL_ONLY mode that performs only the snapshot and then stops, preventing unwanted binlog consumption after a resumed snapshot.
/**
* Perform a snapshot and then stop before attempting to read the binlog.
*/
INITIAL_ONLY("initial_only", true);
if (taskContext.isInitialSnapshotOnly()) {
logger.warn("This connector will only perform a snapshot, and will stop after that completes.");
chainedReaderBuilder.addReader(new BlockingReader("blocker", "Connector has completed all of its work but will continue in the running state. It can be shut down at any time."));
chainedReaderBuilder.completionMessage("Connector configured to only perform snapshot, and snapshot completed successfully. Connector will terminate.");
}Incremental mode: After snapshot, the job must continue with the original binlog position. The team enabled schema_only_recovery and manually set the binlog file and position:
DebeziumOffset specificOffset = new DebeziumOffset();
Map<String, Object> sourceOffset = new HashMap<>();
sourceOffset.put("file", startupOptions.specificOffsetFile);
sourceOffset.put("pos", startupOptions.specificOffsetPos);
specificOffset.setSourceOffset(sourceOffset);3.2 Patch Update (Partial Update)
Hudi’s default payload overwrites the whole record. To support partial updates (PATCH semantics), the team contributed OverwriteNonDefaultsWithLatestAvroPayload which merges non‑null fields, and filed upstream as HUDI‑1255.
3.3 Row‑Key Merging Within a Batch
Because CDC can emit multiple changes for the same row key within a micro‑batch, the streaming job merges records by timestamp to ensure later updates overwrite earlier ones.
3.4 Schema Evolution
Hudi leverages Avro’s schema compatibility. Four compatibility modes are described (backward, forward, full, none). The team uses backward‑compatible schema extensions but notes large files and GC pressure when many columns are added.
3.5 Simultaneous Query and Write Exceptions
Presto queries sometimes failed with NPE due to concurrent metadata updates. The fix replaced hoodiePathCache with a thread‑safe ConcurrentHashMap and repackaged hudi‑hadoop‑mr.jar and hudi‑common.jar in Presto’s plugin directory.
4. Effects
Simplified real‑time data writes; developers no longer need to distinguish inserts from updates.
Minute‑level latency from ingestion to queryable state despite using COW tables.
Reduced offline processing volume by leveraging Hudi’s incremental view, shortening batch runtimes.
5. Future Plans
5.1 Flink Integration
Monitoring the progress of native Flink support in Hudi (RFC‑24) and planning to adopt Hudi 0.8.0’s deep Flink integration to eliminate the dual‑engine (Flink + Spark) setup.
5.2 Concurrent Writes
Before Hudi 0.8.0, concurrent writes were unsupported. The team explored vertical partitioning and CDC‑to‑Kafka replay as workarounds, and will evaluate the optimistic‑lock based concurrent write mode introduced in 0.8.0.
5.3 Performance Optimizations
Indexing bottlenecks with HoodieGlobalBloomIndex lead to long index build times; the team is considering HBaseIndex or custom indexes. They also plan to tune file size/number parameters, experiment with MOR tables for write‑heavy workloads, and balance trade‑offs for faster upserts.
6. Conclusion
The CDC‑to‑Hudi pipeline successfully addresses mutable data challenges, provides schema evolution, separates compute and storage, and enables incremental views and time‑travel queries. Ongoing work includes deeper Flink integration, concurrent write support, and further performance tuning.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
