Big Data 19 min read

Kuaishou Data Lake Construction with Apache Hudi: Architecture, Challenges, and Solutions

This article explains why Kuaishou built a data lake, describes its Hudi‑based architecture, outlines five major challenges encountered during implementation, and presents the solutions and future development plans, illustrating performance improvements and practical use cases across various business scenarios.

DataFunTalk

May 15, 2023

Kuaishou Data Lake Construction with Apache Hudi: Architecture, Challenges, and Solutions

Introduction Kuaishou built a data lake to achieve core data‑construction goals such as standardization, sharing, simplicity, high performance, and reliability. The presentation outlines the lake architecture, Hudi‑based implementation, practical cases, and future plans.

Data Lake Architecture The traditional Lambda architecture suffers from offline latency, heterogeneous processing logic, and data silos. Kuaishou adopted a centralized lake with features like massive storage, extensible data types, schema evolution, multi‑source support, strong data management, efficient processing, and high‑performance analytics. After evaluating open‑source options, Hudi was chosen for its strong update capability, stream‑batch read/write, pluggable payloads, MOR table type, and broad engine compatibility.

Hudi‑Based Implementation Using Flink as the stream‑batch engine, Kuaishou optimized write paths to stream directly to disk, introduced dynamic load‑balancing for partition distribution, and implemented automatic partition publishing for aligned snapshot triggers. Hudi’s capabilities—various write modes, pluggable logic, multiple table types, metadata tracking, and Hadoop‑compatible input formats—enable unified, efficient data pipelines.

Key Challenges and Solutions

1. Ingestion bottleneck – Optimized Flink write logic to stream writes, balanced partition distribution, and added automatic partition release to achieve >10× throughput and controlled file sizes.

2. Snapshot query by data time – Extended Hudi Time‑Travel with version metadata derived from data timestamps, allowing SQL to query snapshots using the correct version.

3. Flink‑On‑Hudi update bottleneck – Separated operations, enabled parallel execution, and used state and bucket indexes with TTL to manage concurrent writes and compaction.

4. Multi‑task merge limitation – Implemented logical bucketing on top of physical buckets, allowing finer‑grained parallel merges and automatic schema propagation.

5. Production reliability – Simplified configuration, added pre‑commit consistency checks, and enriched metrics for stability and monitoring.

Practical Cases After lake construction, Kuaishou achieved real‑time DWD layer updates (50%+ latency reduction), minute‑level activity snapshot queries (15% resource saving), streamlined retention table updates (50% faster), and efficient wide‑table generation without HBase, cutting processing time by up to 5 hours.

Future Development Planned improvements include richer metadata services, unified offline‑online query support, seamless migration for existing pipelines, and full stream‑batch integration to provide high‑efficiency unified queries.

Q&A Highlights

Locking and OCC mechanisms are used to avoid write conflicts during concurrent base‑file updates.

Compaction resource mismatches are addressed by separating write stages and scaling memory/CPU per workload, achieving sub‑10‑minute compaction for 1‑2 GB files.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Real-time Processing Flink Data Lake Apache Hudi Kuaishou

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.