How Kuaishou Boosted Data Efficiency with Apache Hudi: Real‑Time + Offline Solutions
This article explains how Kuaishou tackled late data scheduling, costly synchronization, and inefficient back‑fills by adopting Apache Hudi, detailing the pain points, reasons for choosing Hudi, and step‑by‑step implementation to achieve fast, fresh, and scalable data processing.
1. Pain Points
Data scheduling starts late and has long compute cycles, causing delayed readiness. Data synchronization involves massive incremental data and costly merge calculations, putting pressure on SLA. Repair back‑fills require only partial updates but the offline warehouse forces full batch rewrites, consuming resources and extending cycles.
2. Why Choose Hudi
Hudi offers rich functionality, good integration with Kuaishou’s architecture, high automation, tight Flink integration, and an active community. It enables real‑time computation with Flink/Kafka and supports CRUD on offline tables, meeting the “fast‑ready, fresh‑state” requirement.
3. How to Use Hudi
Key steps include:
Data shuffling to avoid hotspot skew.
Even partition distribution across nodes.
Shuffle to align partitions, preventing write‑lock contention.
Merge strategy to keep only needed records.
Index lookup to map records to files efficiently.
Discard strategy to skip already‑updated records, improving write throughput.
Hudi stores data in Hive‑compatible partitions (Base on HDFS). Proper primary‑key and partition design, merge and index strategies, and concurrency controls (mod‑based partitioning) reduce data volume and maximize write efficiency.
Monitoring uses a six‑stage pipeline to quickly locate latency or accuracy issues, and a probe compares a sample of business data with Row Tables to ensure >95% consistency.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Kuaishou Big Data
Technology sharing on Kuaishou Big Data, covering big‑data architectures (Hadoop, Spark, Flink, ClickHouse, etc.), data middle‑platform (development, management, services, analytics tools) and data warehouses. Also includes the latest tech updates, big‑data job listings, and information on meetups, talks, and conferences.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
