Big Data 20 min read

Kuaishou’s Data Lake Architecture with Apache Hudi: Design, Challenges, Solutions, and Future Plans

This article presents Kuaishou’s journey in building a data lake using Apache Hudi, detailing the lake architecture, key challenges such as ingestion bottlenecks and update inefficiencies, the solutions implemented, practical case studies, and the roadmap for future enhancements.

DataFunSummit

Jun 3, 2023

Kuaishou’s Data Lake Architecture with Apache Hudi: Design, Challenges, Solutions, and Future Plans

Kuaishou introduced its data lake initiative, explaining why a data lake was needed, the problems encountered during construction, the achievements obtained, and the outlook for future development.

Main contents include: 1) Data lake architecture; 2) Building Kuaishou’s data lake based on Apache Hudi; 3) Practical case studies; 4) Development roadmap.

1. Data Lake Architecture: From Offline Warehouse to Lake‑Warehouse Integration

The core goals of data construction are standardization, sharing, simplicity, high performance, and reliability. Traditional Lambda architecture suffers from three serious issues: poor timeliness of offline links, heterogeneous processing logic, and data silos.

To address these, Kuaishou adopted a centralized data lake solution that offers massive storage, extensible data types, schema evolution, support for diverse sources, strong data management, efficient processing, and high‑performance analytics. After evaluating open‑source lake solutions, Hudi was selected for its strong update capability, support for stream‑batch reads/writes, pluggable payloads, MOR table type, and compatibility with multiple query engines.

Kuaishou’s Hudi‑based data lake architecture provides benefits such as CRUD operations, unified stream‑batch processing, and massive data management, reducing cost and improving timeliness.

2. Building the Data Lake with Hudi

Key Hudi capabilities used include multiple write modes, pluggable update logic, various table types (e.g., MOR for write‑heavy scenarios), metadata statistics for versioning, and compatibility with Hadoop input formats (Spark, Trino, etc.). These enable faster, unified data pipelines.

Specific optimizations:

Optimized write path: switched from buffered to streaming writes, allowing parallel partition writes and higher CPU utilization.

Dynamic data distribution: a module balances data flow across write nodes, preventing hotspots.

Automatic partition publishing: timestamps drive partition creation and coordinated snapshot triggering.

3. Solving Five Critical Issues

1) Ingestion bottleneck – Flink‑on‑Hudi suffered from uneven bucket distribution and back‑pressure. Kuaishou introduced streaming writes, dynamic load balancing, and coordinated snapshot commits, achieving >10× throughput improvement.

2) Snapshot query using data time – Added time‑version metadata to enable Time‑Travel queries based on data timestamps rather than commit timestamps.

3) Flink‑on‑Hudi update bottleneck – Separated operations, allowed parallel execution, and used state and bucket indexes to reduce resource contention.

4) Insufficient multi‑task merge capability – Implemented logical bucketing and schema‑aware merging, enabling concurrent merge plans and automatic schema evolution.

5) Production reliability – Simplified configuration, introduced pre‑commit validation, and enhanced metrics for stability and consistency.

4. Practical Case Studies

Kuaishou applied the new lake to several business scenarios, achieving notable gains:

Real‑time DWD layer generation, improving timeliness by over 50% with similar compute cost.

Activity snapshot queries reduced latency from hours to minutes, saving ~15% compute resources.

Retention data processing shifted from multi‑day batch merges to incremental updates, cutting processing time by ~50%.

Wide‑table feature production moved from HBase to Hudi, eliminating extra maintenance and accelerating processing.

5. Development Roadmap

Future work includes完善元数据管理, supporting real‑time tables, achieving seamless compatibility with existing pipelines, and delivering a unified stream‑batch data production platform with high‑performance query capabilities.

Q&A Highlights

Q1: Details on locking – Kuaishou leverages OCC locking and CAS checks to avoid concurrent write conflicts, and uses placeholder‑based merge plans to keep writes independent of compaction.

Q2: Compaction resource mismatch – Resources are tuned per operation; write tasks use 6‑8 GB memory, while compaction is allocated 4 CPU cores and ~10 GB memory to handle larger files efficiently.

Overall, Kuaishou’s Hudi‑based data lake demonstrates how a well‑designed lake architecture can overcome traditional warehouse limitations, improve data timeliness, reduce operational costs, and provide a scalable foundation for future data‑driven innovations.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Flink Data Lake Apache Hudi Kuaishou

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.