Big Data 10 min read

ByteDance’s Journey to a Unified Data Lake with Flink and Hudi

This article recounts ByteDance’s evolution from batch‑only Flink pipelines to a unified data‑lake integration platform, detailing the three integration modes, challenges with Spark‑based CDC, the decision to adopt Hudi over Iceberg, and how Hudi’s indexing and Merge‑On‑Read formats enable near‑real‑time analytics at massive scale.

Volcano Engine Developer Services
Volcano Engine Developer Services
Volcano Engine Developer Services
ByteDance’s Journey to a Unified Data Lake with Flink and Hudi

ByteDance Data Integration Overview

Since 2018 ByteDance has built Flink‑based batch sync channels to move online databases into offline warehouses, added a real‑time MQ‑to‑Hive/HDFS pipeline in 2020, and in 2021 created a real‑time data‑lake integration channel that unifies batch, streaming, and incremental modes.

The system now supports dozens of pipelines covering MySQL, Oracle, MongoDB, Kafka, RocketMQ, HDFS, Hive, ClickHouse and serves most business lines such as Douyin and Toutiao.

Three Integration Modes

Batch mode uses Flink Batch to move data in bulk across more than 20 source types.

Streaming mode pulls data from MQ into Hive and HDFS, offering high stability and low latency.

Incremental (CDC) mode captures binlog changes and writes them to external stores, handling high‑volume, low‑latency updates.

Problems with the original Spark‑based CDC path included heavy resource consumption, long processing chains, and high operational cost.

To address these, ByteDance merged the incremental mode into the streaming integration, eliminating Spark and achieving a unified Flink engine that supports all three scenarios with near‑real‑time latency and lower cost.

Choosing a Data‑Lake Framework

After evaluating Apache Iceberg and Apache Hudi, ByteDance selected Hudi because its CDC capabilities are more mature, its community evolves rapidly, and its integration with Flink is robust.

Why Hudi’s Index System Matters

Hudi provides multiple index types (Bloom, hash, state, HBase) that dramatically reduce merge costs; without an index, full‑table shuffles can double processing time and grow exponentially with data size.

Examples:

Log‑deduplication uses a timestamp column; cold‑hot partitions benefit from Bloom or TTL‑based state indexes.

CDC workloads with random updates benefit from hash, state, or HBase indexes for efficient global indexing.

Merge‑On‑Read (MOR) Table Format

Hudi’s MOR format stores updates in small Avro log files while keeping historical data in large columnar base files (Parquet/ORC). Writes are low‑latency to log files; reads merge log and base files, and periodic compaction keeps log size in check, enabling real‑time writes and near‑real‑time queries.

Incremental Computation

Hudi’s incremental query feature lets users pull data changes by timestamp, supporting Lambda‑style architectures that serve both streaming and batch workloads with minute‑level latency.

Conclusion

By adopting Hudi as the data‑lake foundation, ByteDance built a customized solution that supports all update‑heavy pipelines, reduces cost, and delivers near‑real‑time analytics. The technology is now exposed through the Volcano Engine DataLeap suite.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

FlinkStreamingCDCHudi
Volcano Engine Developer Services
Written by

Volcano Engine Developer Services

The Volcano Engine Developer Community, Volcano Engine's TOD community, connects the platform with developers, offering cutting-edge tech content and diverse events, nurturing a vibrant developer culture, and co-building an open-source ecosystem.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.