How Kuaishou’s Real‑Time Data Lake Boosts AI and BI Architecture
The article explains how Kuaishou partnered with Apache Hudi to overhaul its ODS‑based data lake, addressing latency, storage cost, and complexity for AI and BI workloads, detailing the evolution from mysql‑to‑hive to mysql‑to‑hudi 1.0 and 2.0, the resulting performance gains, cost savings, and future roadmap.
1. Background and Pain Points
Kuaishou’s legacy ODS layer suffered from four major issues: data readiness was delayed, leading to broken‑line risks; storage cost was high because ODS retained massive un‑trimmed raw data; read patterns were diverse (full partitions, incremental partitions, snapshot reads); and merge‑on‑write computation was resource‑intensive during nightly peaks. In the BI domain, these problems caused redundant tables, complex processing logic, and low maintainability. In the AI domain, inconsistent offline and real‑time training data resulted in divergent model performance, low training efficiency, and high feature‑addition cost due to duplicated pipelines on HDFS and Kafka.
2. Why Hudi?
To solve the above pain points, Kuaishou introduced Apache Hudi, leveraging its capabilities of merge‑on‑read , time‑travel , and unified incremental/full‑partition handling. Hudi enables a single storage format that supports both batch and streaming workloads, guaranteeing data consistency between offline and real‑time training.
3. BI Practice – Evolution of Ingestion
The ingestion path evolved through three stages:
mysql2hive : Data was flushed to a full‑partition table daily; incremental and full partitions co‑existed, causing high storage cost and slow nightly merges.
mysql2hudi 1.0 : Adopted a non‑partitioned Hudi table. Data was written via Flink consuming Kafka, producing a log file every 5 minutes. Time‑travel allowed a single logical table to serve multiple full‑partition reads, eliminating daily full‑partition duplication.
mysql2hudi 2.0 : Introduced partitioned Hudi tables with mixed incremental and full partitions. Bucket numbers could be adjusted per table, enabling efficient handling of long‑lived data (30‑day retention in Hudi, older data cold‑backed to Hive). Full‑compaction and minor‑compaction policies (triggered when log‑file count exceeds a threshold or after 7 days) merged incremental logs into compacted files, reducing file count and read latency.
Performance impact: processing time for a typical ODS load dropped from 5‑6 hours to under 5 minutes after adopting Hudi 2.0.
4. AI Practice – Unified Storage Architecture
The AI pipeline was unified under an AIDataLake built on Hudi. Both offline and real‑time training now read from the same Hudi tables via a meta‑server that abstracts vectorized, stream‑batch‑unified reads. This eliminated divergent storage media (HDFS vs. Kafka) and reduced compute cost. Logical wide tables and schema‑auto‑evolution further accelerated feature iteration, achieving up to 10× faster iteration cycles.
5. Core Technical Optimizations
Key optimizations include:
Native record serialization to cut serialization overhead.
Real‑time, non‑blocking writes with flexible bucket‑number redesign, allowing different bucket counts for base and delta partitions.
Event‑time based Hudi timeline (replacing process‑time) to guarantee ordered consumption and flexible data overwrites.
Lock‑free commit path to avoid latency spikes in high‑throughput real‑time ingestion.
These changes together reduced overall storage cost by 50‑60 % and saved tens of millions of RMB in operational expenses.
6. Future Outlook
Roadmap focuses on three fronts: (1) extending AI capabilities to vector search and multimodal data storage; (2) deepening real‑time lake ingestion for more BI scenarios; (3) service‑orienting metadata management (table‑service, schema‑service) to make the data lake more cohesive and support incremental data processing pipelines.
Conclusion
Kuaishou’s partnership with Hudi transformed its data lake from a fragmented ODS‑centric system into a unified, cost‑effective platform that supports consistent AI and BI workloads, dramatically improves latency, cuts storage cost, and provides a scalable foundation for future data‑driven innovations.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
