Big Data 9 min read

Real‑Time Data Lake Practices at ByteDance and Alibaba: Architecture, Challenges, and Solutions

This article presents detailed case studies of ByteDance and Alibaba implementing real‑time data lake solutions with Hudi and Flink, describing the business drivers, architectural challenges, and the specific technical strategies such as unified metadata layers, optimistic locking, scalable hash indexing, and CDC‑based incremental ETL to achieve low‑latency, high‑throughput data processing.

DataFunTalk
DataFunTalk
DataFunTalk
Real‑Time Data Lake Practices at ByteDance and Alibaba: Architecture, Challenges, and Solutions

Many enterprises increasingly require real‑time data warehouses to monitor overall website traffic and ad exposure metrics; traditional databases and implementations cannot meet these massive, low‑latency demands, prompting the need for distributed, high‑throughput, low‑delay, and reliable real‑time computing frameworks.

This article shares the practical applications of real‑time data lakes from two leading companies, ByteDance and Alibaba.

01 Real‑Time Data Lake at ByteDance

In recent years, data lakes have become a hot technology, evolving rapidly from traditional warehouses. Hudi, Iceberg, and Delta Lake are known as the three swords of data lakes.

ByteDance focuses on six core capabilities: efficient concurrent updates, intelligent query acceleration, unified batch‑stream storage, unified metadata and permission management, extreme query performance, and AI + BI integration.

The company initially built its lake on the open‑source Hudi framework and encountered four main challenges: difficulty managing data, weak concurrent update support, poor update performance, and logs hard to ingest.

How ByteDance addressed these challenges:

Unified metadata layer: Built a metadata abstraction above the lake and warehouse to hide heterogeneity of underlying systems, providing a single source for BI tools, compute engines, data governance, and permission control.

Optimistic‑lock based concurrent updates: Implemented optimistic locking on Hudi's Metastore Server timeline, enabling row‑level and column‑level concurrent write strategies and flexible conflict‑check/merge policies.

Scalable hash data structure: Replaced Bloom filter indexing with an extensible hash that avoids reading historical data, allowing automatic bucket splitting/merging and fast location of records.

Index‑free ingestion: Bypassed Hudi's index mechanism, using shuffle hash join and broadcast join to achieve real‑time data lake ingestion without primary‑key‑based upserts.

02 Alibaba Incremental ETL Architecture Based on Flink + Hudi

Over the past six months, Alibaba's Computing Platform Division SQL Engine team has been developing the Apache Flink SQL module, focusing on integrating Flink with Hudi.

Alibaba chose Hudi over Iceberg or Delta Lake because of Hudi's strong transaction management and upsert capabilities, which enable snapshot‑level transactions and massive data upserts.

Near‑real‑time database ingestion: Using Debezium to capture MySQL binlog and Flink CDC connector to stream changes directly into Hudi, Alibaba eliminates the need for an intermediate Kafka layer, achieving minute‑level ingestion.

Minute‑level incremental data warehouse: Hudi's upsert support allows continuous consumption of change data downstream via Flink CDC, enabling incremental computation on top of existing state and building a near‑real‑time ODS layer.

03 More Big‑Data Technology Case Studies

The full "Big Data Technology Application Cases" manual (pages 55, 79, etc.) contains additional implementations from companies such as Xiaomi, Tencent, NetEase, JD.com, and Bilibili. QR codes in the article allow readers to download the complete cases for free.

Readers are encouraged to like, share, and give a "three‑hit" interaction at the end of the article.

big dataFlinkmetadata managementHudireal-time data lakeincremental ETL
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.