Apache Hudi Practice at Kuaishou: Solving Data Efficiency Challenges
This article details Kuaishou's adoption of Apache Hudi to address data scheduling, synchronization, and massive update inefficiencies, describing the pain points, evaluation of alternatives, architectural integration with Spark/Flink, implementation challenges, and the performance improvements achieved.
Kuaishou's data growth team, led by Jin Guowei, presented a three-part case study on how Apache Hudi was used to solve efficiency problems in their data pipeline.
The primary pain points included data scheduling, synchronization, and full‑scale data back‑filling, all of which suffered from low throughput and high resource waste.
Business requirements demanded faster result visibility similar to operational databases while still leveraging big‑data analytics, prompting a need for a solution that combined real‑time capabilities with CRUD operations on large datasets.
After surveying industry options, Kuaishou selected Hudi based on its rich feature set, alignment with their pain points, automation level, tight integration with Flink, and active community support.
Hudi's architecture uses Spark/Flink to ingest upstream data into raw tables in a data lake, providing full CRUD capabilities that match Kuaishou's internal needs.
The write flow involves a rebalance step to avoid hotspot issues, data merging, deduplication, and discarding of irrelevant records, ultimately storing data in Parquet files with embedded metadata.
Key challenges addressed include the high latency of massive data updates, inefficient resource usage for partial updates, and the need to redesign the data model and write strategy (primary key, partitioning, and policies) to fit Hudi's paradigm.
By designing a suitable partitioning scheme and file size, Kuaishou mitigated update jitter and achieved efficient partial updates for back‑fill scenarios, establishing a reusable solution.
Additional efforts focused on ensuring stable Hudi job execution, covering workflow design, timeliness, and accuracy.
Post‑implementation results showed significant improvements in timeliness, resource consumption, and overall effectiveness of the generalized Hudi‑based solution.
Big Data Technology Architecture
Exploring Open Source Big Data and AI Technologies
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
