Kuaishou’s Data Lake Upgrade with Hudi: Solving AI & BI Challenges
The article explains how Kuaishou modernized its data lake by partnering with Apache Hudi to address latency, storage cost, and consistency issues in both AI and BI pipelines, detailing architectural changes, new ingestion tools, partitioning strategies, compaction mechanisms, performance gains and future plans.
Background and Pain Points
Kuaishou’s original ODS layer suffered from delayed data readiness (5‑6 h), high storage cost due to keeping raw source data, and a variety of read patterns (full, incremental, snapshot) that required separate ingestion pipelines. Merge‑on‑write logic also consumed heavy compute during nightly batch windows. In the AI domain, offline and real‑time training used different storage formats and read logic, causing inconsistent model performance, limited Kafka replay windows (max 1 day) that slowed training, and duplicated feature‑engineering pipelines across HDFS and Kafka.
Hudi‑Based Architecture
To address these issues Kuaishou built a unified AIDataLake on Apache Hudi, providing:
Full‑copy‑on‑read and merge‑on‑read storage so offline and streaming jobs see identical data.
A table‑service that automatically triggers full and minor compactions, keeping Hudi log‑file sizes under configurable thresholds and eliminating small‑file overhead.
Ingestion pipelines kafka2hudi and mysql2hudi that write non‑partitioned or partitioned Hudi tables within minutes.
Hybrid partition design (incremental + full partitions) that isolates recent hot data for fast access while preserving long‑lived data in full partitions.
Bucket‑level join and adjustable bucket‑size design to support very large tables without overwhelming the file system.
All read and write paths converge on the same Hudi tables, removing divergent pipelines and reducing compute cost.
BI Ingestion Evolution
The BI data flow progressed through three stages:
mysql2hive – traditional batch load into Hive, high latency and storage duplication.
mysql2hudi 1.0 – non‑partitioned Hudi table with time‑travel reads; reduced storage cost and query latency, but all data lived in a single directory, causing ls performance degradation for long‑lived tables.
mysql2hudi 2.0 – partitioned Hudi table with mixed incremental/full partitions; a mapping‑table layer merges delta and full partitions on demand; bucket‑level compaction and adjustable bucket count; 30‑day recent data kept in Hudi, older data cold‑backed to Hive. Result: 50‑60 % storage reduction and data‑ready latency dropped from 5‑6 h to <5 min.
Compaction strategy includes:
Log‑file size threshold – when a log file exceeds the configured size, a minor compaction is triggered.
Time‑based full compaction – if a partition has not been fully compacted within 7 days, a full compaction runs automatically.
AI Data Lake Upgrade
AI workloads now ingest data through real‑time Flink jobs directly into Hudi, using an event‑time timeline (instead of Hudi’s default process‑time) to guarantee ordered, replayable streams. Logical wide tables separate base (parquet) and delta (log) layers, enabling fast feature addition/removal without full table rewrites. Additional optimisations:
Soft‑merge algorithm reduces read‑amplification during merge‑on‑read.
Native record serialization lowers CPU overhead for write‑heavy workloads.
Non‑blocking writes avoid lock contention in high‑throughput streaming.
These changes deliver consistent training results across offline and real‑time data, up to 10× faster iteration cycles, and multi‑million‑dollar cost savings.
Key Hudi Features Implemented
Merge‑on‑read storage with time‑travel queries.
Table service with automated full and minor compaction.
Hybrid incremental/full partitioning.
Bucket‑level join and dynamic bucket‑size configuration.
Event‑time timeline for ordered streaming consumption.
Logical wide tables (base + delta) for feature engineering.
Native record format to reduce serialization cost.
Future Directions
Kuaishou plans to extend the AIDataLake with vector search and multimodal data storage, broaden real‑time lake adoption for additional BI scenarios, and build a metadata‑as‑a‑service (table‑management service) to further decouple data production from consumption. Incremental processing pipelines will also be explored to improve scalability.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
