Big Data 13 min read

How Kuaishou Boosted Data Efficiency with Apache Hudi: Real‑Time + Offline Solutions

This article explains how Kuaishou tackled late data scheduling, costly synchronization, and inefficient back‑fills by adopting Apache Hudi, detailing the pain points, reasons for choosing Hudi, and step‑by‑step implementation to achieve fast, fresh, and scalable data processing.

Kuaishou Big Data
Kuaishou Big Data
Kuaishou Big Data
How Kuaishou Boosted Data Efficiency with Apache Hudi: Real‑Time + Offline Solutions

1. Pain Points

Data scheduling starts late and has long compute cycles, causing delayed readiness. Data synchronization involves massive incremental data and costly merge calculations, putting pressure on SLA. Repair back‑fills require only partial updates but the offline warehouse forces full batch rewrites, consuming resources and extending cycles.

Data pain points diagram
Data pain points diagram

2. Why Choose Hudi

Hudi offers rich functionality, good integration with Kuaishou’s architecture, high automation, tight Flink integration, and an active community. It enables real‑time computation with Flink/Kafka and supports CRUD on offline tables, meeting the “fast‑ready, fresh‑state” requirement.

Hudi advantages
Hudi advantages

3. How to Use Hudi

Key steps include:

Data shuffling to avoid hotspot skew.

Even partition distribution across nodes.

Shuffle to align partitions, preventing write‑lock contention.

Merge strategy to keep only needed records.

Index lookup to map records to files efficiently.

Discard strategy to skip already‑updated records, improving write throughput.

Hudi stores data in Hive‑compatible partitions (Base on HDFS). Proper primary‑key and partition design, merge and index strategies, and concurrency controls (mod‑based partitioning) reduce data volume and maximize write efficiency.

Hudi storage layout
Hudi storage layout

Monitoring uses a six‑stage pipeline to quickly locate latency or accuracy issues, and a probe compares a sample of business data with Row Tables to ensure >95% consistency.

Monitoring workflow
Monitoring workflow
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

FlinkIndexingdata lakeHudi
Kuaishou Big Data
Written by

Kuaishou Big Data

Technology sharing on Kuaishou Big Data, covering big‑data architectures (Hadoop, Spark, Flink, ClickHouse, etc.), data middle‑platform (development, management, services, analytics tools) and data warehouses. Also includes the latest tech updates, big‑data job listings, and information on meetups, talks, and conferences.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.