Building Real-Time Data Warehouse with Flink + Hudi at Shopee
Shopee replaced its hourly Hive pipeline with a hybrid Flink‑Hudi real‑time data warehouse that groups Kafka topics, applies lightweight stream ETL, uses partial‑update MOR tables for multi‑stream joins and COW tables for versioned batches, cutting latency from about 90 minutes to 2–30 minutes and halving resource usage.
This article introduces Shopee's real-time data warehouse solution built on Flink + Hudi, addressing the limitations of their traditional Hive-based hourly data pipeline. The solution combines stream processing for data acceleration with batch processing to ensure data consistency.
Background and Pain Points: The team faced challenges with their hourly data pipeline including redundant computations across different hours, extensive full table JOIN operations, and variable data latency ranging from minutes to tens of minutes. With rapidly growing data volumes, maintaining hourly SLA became increasingly difficult.
Architecture Design: The DataFlow architecture consists of: (1) Grouped Kafka Topics - combining multiple topics with the same primary key for single Flink job consumption; (2) Generic Stream ETL - performing simple transformations like project, filter, map without using Flink State; (3) Partial Update Hudi Table - using custom PartialUpdateAvroPayload to update specific columns based on primary key, effectively achieving multi-stream JOIN; (4) Other Hudi and Hive tables as batch processing data sources; (5) Periodic Batch Processing - computing complex metrics and cross-source derived indicators; (6) Multi-version Hudi Table - storing snapshots of each batch result.
Key Technical Decisions: The solution uses MOR (Merge on Read) tables for Partial Update Hudi tables for lower latency, and COW (Copy on Write) tables for Multi-version Hudi tables. The team chose a hybrid approach combining stream and batch processing rather than fully real-time processing to handle 10+ stream JOINs cost-effectively and ensure dimension data consistency.
Implementation Details: Bootstrap process uses Bulk Insert with Clustering for small file management. Flink Indexing builds file index after bootstrap. Key configuration parameters include hoodie.keep.min.commits, hoodie.cleaner.commits.retained, and compaction settings.
Results: For user dimension tables: ~40% resource reduction, latency reduced from ~90 minutes to ~2 minutes. For shop dimension tables: ~54% resource reduction, latency reduced from ~90 minutes to ~30 minutes.
Shopee Tech Team
How to innovate and solve technical challenges in diverse, complex overseas scenarios? The Shopee Tech Team will explore cutting‑edge technology concepts and applications with you.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.