ByteDance’s Journey to a Unified Data Lake with Flink and Hudi
This article recounts ByteDance’s evolution from batch‑only Flink pipelines to a unified data‑lake integration platform, detailing the three integration modes, challenges with Spark‑based CDC, the decision to adopt Hudi over Iceberg, and how Hudi’s indexing and Merge‑On‑Read formats enable near‑real‑time analytics at massive scale.
ByteDance Data Integration Overview
Since 2018 ByteDance has built Flink‑based batch sync channels to move online databases into offline warehouses, added a real‑time MQ‑to‑Hive/HDFS pipeline in 2020, and in 2021 created a real‑time data‑lake integration channel that unifies batch, streaming, and incremental modes.
The system now supports dozens of pipelines covering MySQL, Oracle, MongoDB, Kafka, RocketMQ, HDFS, Hive, ClickHouse and serves most business lines such as Douyin and Toutiao.
Three Integration Modes
Batch mode uses Flink Batch to move data in bulk across more than 20 source types.
Streaming mode pulls data from MQ into Hive and HDFS, offering high stability and low latency.
Incremental (CDC) mode captures binlog changes and writes them to external stores, handling high‑volume, low‑latency updates.
Problems with the original Spark‑based CDC path included heavy resource consumption, long processing chains, and high operational cost.
To address these, ByteDance merged the incremental mode into the streaming integration, eliminating Spark and achieving a unified Flink engine that supports all three scenarios with near‑real‑time latency and lower cost.
Choosing a Data‑Lake Framework
After evaluating Apache Iceberg and Apache Hudi, ByteDance selected Hudi because its CDC capabilities are more mature, its community evolves rapidly, and its integration with Flink is robust.
Why Hudi’s Index System Matters
Hudi provides multiple index types (Bloom, hash, state, HBase) that dramatically reduce merge costs; without an index, full‑table shuffles can double processing time and grow exponentially with data size.
Examples:
Log‑deduplication uses a timestamp column; cold‑hot partitions benefit from Bloom or TTL‑based state indexes.
CDC workloads with random updates benefit from hash, state, or HBase indexes for efficient global indexing.
Merge‑On‑Read (MOR) Table Format
Hudi’s MOR format stores updates in small Avro log files while keeping historical data in large columnar base files (Parquet/ORC). Writes are low‑latency to log files; reads merge log and base files, and periodic compaction keeps log size in check, enabling real‑time writes and near‑real‑time queries.
Incremental Computation
Hudi’s incremental query feature lets users pull data changes by timestamp, supporting Lambda‑style architectures that serve both streaming and batch workloads with minute‑level latency.
Conclusion
By adopting Hudi as the data‑lake foundation, ByteDance built a customized solution that supports all update‑heavy pipelines, reduces cost, and delivers near‑real‑time analytics. The technology is now exposed through the Volcano Engine DataLeap suite.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Volcano Engine Developer Services
The Volcano Engine Developer Community, Volcano Engine's TOD community, connects the platform with developers, offering cutting-edge tech content and diverse events, nurturing a vibrant developer culture, and co-building an open-source ecosystem.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
