Big Data 20 min read

Building Real-Time Data Warehouse with Flink + Hudi at Shopee

Shopee replaced its hourly Hive pipeline with a hybrid Flink‑Hudi real‑time data warehouse that groups Kafka topics, applies lightweight stream ETL, uses partial‑update MOR tables for multi‑stream joins and COW tables for versioned batches, cutting latency from about 90 minutes to 2–30 minutes and halving resource usage.

Shopee Tech Team

Apr 28, 2022

Building Real-Time Data Warehouse with Flink + Hudi at Shopee

This article introduces Shopee's real-time data warehouse solution built on Flink + Hudi, addressing the limitations of their traditional Hive-based hourly data pipeline. The solution combines stream processing for data acceleration with batch processing to ensure data consistency.

Background and Pain Points: The team faced challenges with their hourly data pipeline including redundant computations across different hours, extensive full table JOIN operations, and variable data latency ranging from minutes to tens of minutes. With rapidly growing data volumes, maintaining hourly SLA became increasingly difficult.

Architecture Design: The DataFlow architecture consists of: (1) Grouped Kafka Topics - combining multiple topics with the same primary key for single Flink job consumption; (2) Generic Stream ETL - performing simple transformations like project, filter, map without using Flink State; (3) Partial Update Hudi Table - using custom PartialUpdateAvroPayload to update specific columns based on primary key, effectively achieving multi-stream JOIN; (4) Other Hudi and Hive tables as batch processing data sources; (5) Periodic Batch Processing - computing complex metrics and cross-source derived indicators; (6) Multi-version Hudi Table - storing snapshots of each batch result.

Key Technical Decisions: The solution uses MOR (Merge on Read) tables for Partial Update Hudi tables for lower latency, and COW (Copy on Write) tables for Multi-version Hudi tables. The team chose a hybrid approach combining stream and batch processing rather than fully real-time processing to handle 10+ stream JOINs cost-effectively and ensure dimension data consistency.

Implementation Details: Bootstrap process uses Bulk Insert with Clustering for small file management. Flink Indexing builds file index after bootstrap. Key configuration parameters include hoodie.keep.min.commits, hoodie.cleaner.commits.retained, and compaction settings.

Results: For user dimension tables: ~40% resource reduction, latency reduced from ~90 minutes to ~2 minutes. For shop dimension tables: ~54% resource reduction, latency reduced from ~90 minutes to ~30 minutes.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

stream processing Batch Processing Apache Flink Real-Time Data Warehouse Big Data Architecture Apache Hudi Data Lakehouse

Written by

Shopee Tech Team

How to innovate and solve technical challenges in diverse, complex overseas scenarios? The Shopee Tech Team will explore cutting‑edge technology concepts and applications with you.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.