Big Data 32 min read

Real-Time Data Lake Practice at ByteDance: Architecture, Challenges, and Solutions

ByteDance’s data platform team explains their real‑time data lake implementation, covering its evolving definition, six core capabilities, challenges such as data management, concurrent updates, performance and log ingestion, and detailed case studies of multi‑stage deployment, indexing, metadata services, and future roadmap.

DataFunSummit

Nov 4, 2022

Real-Time Data Lake Practice at ByteDance: Architecture, Challenges, and Solutions

The talk introduces ByteDance’s real‑time data lake, starting with a historical overview of the data lake concept from Hadoop World and how its interpretation shifted from a simple centralized storage to a full‑stack solution that includes transaction layers, indexing, and support for both batch and streaming workloads.

Six essential capabilities are identified: efficient concurrent updates, intelligent query acceleration, unified batch‑stream storage, unified metadata and permission management, extreme query performance, and AI + BI integration.

Implementation challenges are described in four categories: data management difficulty, weak concurrent update support, poor update performance, and log ingestion difficulty. For each, the team details root causes and mitigation strategies, such as building a unified metadata layer compatible with Hive, introducing Hudi Metastore Server for centralized metadata, and using optimistic locking for better concurrent writes.

To address update‑performance bottlenecks, the team evolved from Bloom‑filter indexing to a scalable Bucket Index, which hashes primary keys to file groups, enabling fast data location and supporting both row‑level and column‑level concurrency. They also discuss handling bucket split/merge and reducing small‑file proliferation.

Four practical deployment stages are presented: (1) near‑real‑time data visibility for verification, (2) real‑time ingestion and interactive analysis with compaction services, (3) real‑time multi‑dimensional aggregation for dashboards and data services, and (4) real‑time data association using column‑level writes and asynchronous compaction.

The roadmap outlines three dimensions: functional enhancements (metadata‑driven query push‑down, data acceleration services, smarter indexing, and automated table optimization), open‑source contributions (enhancements to Apache Hudi such as Bucket Index and Metastore Server, plus connectors for Trino/Presto), and commercial products (LAS lake‑warehouse service and EMR stateless cloud‑native warehouse) that bring the internal practices to external customers.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Indexing Streaming metadata management Hudi Real-time Data Lake

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.