Big Data 11 min read

Applying Data Lake (Hudi) at Kuaishou: Architecture Evolution, Use Cases, and Practice

This article details Kuashou's journey of adopting the Hudi data lake, covering business challenges, migration from Hive to Hudi, architectural redesign, promotion strategy, real‑world use cases such as CDC sync and batch‑stream integration, and key lessons learned for large‑scale data engineering.

Big Data Technology & Architecture

Jul 1, 2024

Applying Data Lake (Hudi) at Kuaishou: Architecture Evolution, Use Cases, and Practice

The article shares Kuaishou's data‑lake (Hudi) application practice, reviewing the adoption process from a business perspective, highlighting efficiency gains, cost optimization, and close collaboration between product teams and the technical team.

Key business challenges include continuous growth of the data‑warehouse size leading to linear increases in storage and compute costs, complex governance and operation overhead, cross‑department collaboration delays, and discrepancies between real‑time and offline data that affect decision‑making.

To address these issues, Kuaishou migrated from Hive/Spark to Hudi after evaluating engines on functional richness, compatibility with the existing big‑data stack, and automation level. Hudi was chosen for its rich features, good integration, and lower operational cost. The new architecture adopts a wide‑table design, domain‑centric models, and SLA‑driven sub‑tasks that split large models into manageable pieces.

The promotion strategy follows a phased rollout: first validating critical scenarios that Hive and Spark cannot solve (minute‑level latency, full back‑fill, DAU click latency), then demonstrating universal applicability across business lines, and finally building a standardized toolchain and best‑practice repository to enable cross‑team reuse.

Practical use cases include CDC data synchronization that reduces latency from hours to minutes, batch‑stream integration for high‑frequency events such as the “red‑packet rain” where Hudi enables real‑time user updates, and warehouse optimization that consolidates dozens of models into three entity‑centric models, cutting storage, compute, and maintenance costs while improving N‑retention calculations.

Reflections emphasize demand‑driven technology adoption, institutional support, breaking departmental silos, and leveraging collective wisdom. The experience demonstrates that Hudi can drive significant business innovation and cost savings, offering valuable insights for other organizations pursuing large‑scale data‑lake solutions.

Data Warehouse Big Data Architecture Hudi Kuaishou real-time-sync

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.