Big Data 13 min read

How Kuaishou Boosted Data Efficiency with Apache Hudi: Real‑Time + Offline Solutions

This article explains how Kuaishou tackled late data scheduling, costly synchronization, and inefficient back‑fills by adopting Apache Hudi, detailing the pain points, reasons for choosing Hudi, and step‑by‑step implementation to achieve fast, fresh, and scalable data processing.

Kuaishou Big Data

Oct 21, 2021

How Kuaishou Boosted Data Efficiency with Apache Hudi: Real‑Time + Offline Solutions

1. Pain Points

Data scheduling starts late and has long compute cycles, causing delayed readiness. Data synchronization involves massive incremental data and costly merge calculations, putting pressure on SLA. Repair back‑fills require only partial updates but the offline warehouse forces full batch rewrites, consuming resources and extending cycles.

2. Why Choose Hudi

Hudi offers rich functionality, good integration with Kuaishou’s architecture, high automation, tight Flink integration, and an active community. It enables real‑time computation with Flink/Kafka and supports CRUD on offline tables, meeting the “fast‑ready, fresh‑state” requirement.

3. How to Use Hudi

Key steps include:

Data shuffling to avoid hotspot skew.

Even partition distribution across nodes.

Shuffle to align partitions, preventing write‑lock contention.

Merge strategy to keep only needed records.

Index lookup to map records to files efficiently.

Discard strategy to skip already‑updated records, improving write throughput.

Hudi stores data in Hive‑compatible partitions (Base on HDFS). Proper primary‑key and partition design, merge and index strategies, and concurrency controls (mod‑based partitioning) reduce data volume and maximize write efficiency.

Monitoring uses a six‑stage pipeline to quickly locate latency or accuracy issues, and a probe compares a sample of business data with Row Tables to ensure >95% consistency.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Flink Indexing data lake Hudi

Written by

Kuaishou Big Data

Technology sharing on Kuaishou Big Data, covering big‑data architectures (Hadoop, Spark, Flink, ClickHouse, etc.), data middle‑platform (development, management, services, analytics tools) and data warehouses. Also includes the latest tech updates, big‑data job listings, and information on meetups, talks, and conferences.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.