How ByteDance Leverages Hudi for a Real‑Time Data Lake Platform
This article introduces ByteDance’s real‑time data lake platform built on Apache Hudi, covering Hudi fundamentals, table types, indexing, practical use cases, platform optimizations, and future roadmap, illustrating how the system enables low‑latency, scalable analytics across batch and streaming workloads.
Introduction to Hudi and ByteDance Real‑Time Data Lake Platform
Hudi is a streaming data‑lake platform that provides ACID guarantees, supports real‑time incremental consumption and offline batch updates, and can be queried via Spark, Flink, Presto and other engines.
Hudi Table Structure
Each Hudi table consists of a timeline of commits and file groups. A commit records the files modified in a write operation. Records must have a unique primary key; within a partition the same key resides in a single file group. Files are divided into base files and log files; log files capture updates and are compacted into new base files.
Table Types: COW and MOR
COW (Copy‑On‑Write) tables : suitable for offline batch updates; the old base file is read, merged with new data, and a new base file is written.
MOR (Merge‑On‑Read) tables : suitable for high‑frequency real‑time updates; updates are appended to log files and merged at read time, with periodic compaction to base files.
Indexes
Hudi supports Bloom Filter, HBase and Bucket indexes. The Bucket index (not yet merged to the main branch) hashes the primary key to a bucket, mapping it to a file group, enabling fast data location.
ByteDance Real‑Time Data Lake Use Cases
Typical pipeline: MySQL binlog → Kafka → Spark/Flink streaming → Hudi lake for downstream consumption.
Real‑time scenario : stream updates directly into the lake for immediate use.
Batch scenario : binlog is dumped to HDFS, then ingested hourly or daily.
Recommendation scenario : CDC from a BigTable‑like store to offline storage for OLAP analysis.
Data‑warehouse scenario : back‑fill of PB‑scale tables with partial row/column updates, reducing compute cost and latency.
Optimizations and New Features
ByteDance built a Hudi‑compatible Metastore that offers commit‑based metadata management, optimistic‑lock concurrency, snapshot persistence, and partition pruning. The architecture consists of engine, metadata, and storage layers, with a catalog service that abstracts multiple metastores.
Additional enhancements include:
Lake‑house unified metadata service.
Row‑level concurrent writes with conflict‑checking strategies (row‑level and column‑level).
Bucket index with pruning and join optimizations.
Append mode for log‑only workloads, implemented via a NonIndex approach that avoids primary‑key joins.
Future Roadmap
Planned improvements: partial‑column updates for binlog consumption, extensible hash indexes, sub‑second data visibility, and merge‑tree‑based file distribution, which will be contributed back to the open‑source community.
The described capabilities are already delivered through Volcano Engine’s Lakehouse Analytics Service (LAS), a serverless, Spark/Presto/Flink‑compatible analytics platform.
ByteDance Data Platform
The ByteDance Data Platform team empowers all ByteDance business lines by lowering data‑application barriers, aiming to build data‑driven intelligent enterprises, enable digital transformation across industries, and create greater social value. Internally it supports most ByteDance units; externally it delivers data‑intelligence products under the Volcano Engine brand to enterprise customers.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
