How Shopee Built a Near‑Real‑Time Data Warehouse with Paimon and StarRocks
Shopee combined the Paimon data lake with StarRocks and Flink to create a quasi‑real‑time warehouse, enabling fast task diagnostics and a high‑performance financial reconciliation system while dramatically reducing storage costs and latency through innovative ODS, snapshot, and branch table techniques.
Overview
Shopee’s real‑time data platform integrates Paimon and StarRocks to build a near‑real‑time data warehouse (DW). The solution supports a task‑diagnostics system and a financial reconciliation platform, delivering sub‑minute latency and substantial storage savings.
Key Use Cases
Task‑diagnostics real‑time platform that monitors back‑pressure, resource usage, and latency of Flink jobs.
Financial reconciliation system that replaces traditional Hive pipelines with a Paimon + StarRocks query engine for instant balance checks.
Architecture and Data Flow
Data is ingested via Flink binlog streams, written to an ODS layer in Paimon, then transformed and merged into DWS and DWM layers. StarRocks reads the wide Paimon tables directly, providing second‑level query responses. The pipeline leverages Partial Update, Lookup Join, and Aggregation Merge engines to keep data compact.
ODS Layer Optimization
The ODS layer faces two challenges: daily partitioning for timeliness and handling late‑arriving data without exploding storage. Paimon’s LSM‑tree file layout (Snapshot, Manifest, Data files) shares immutable files across partitions, eliminating redundant copies and reducing the storage inflation factor from 187× to near‑zero.
Branch & Snapshot Features
Paimon 0.9 introduced Branch tables that are logical copies of a snapshot’s metadata stored in a separate namespace. They allow independent reads and writes while sharing underlying data files, enabling efficient day‑cut partitions without full table rewrites.
Day‑Cut Implementation
Using the Branch feature, a Flink job creates a new branch table each day (e.g., 2023‑03‑20_branch) and locks the previous day’s files. Late data is detected by event time and back‑filled into the appropriate branch, all within a single streaming job.
Experimental Results
A 100‑day simulation compared Paimon‑based storage with Hive. After 80 days, Paimon saved over 90 % of space while maintaining a Flink checkpoint interval of 10 minutes. End‑to‑end latency stayed around 5 minutes, and StarRocks delivered sub‑second query responses.
Future Plans
Shopee will continue to adopt Paimon 1.x features (Branch, Tag‑to‑Partition, Append‑Only tables) to broaden use cases beyond key‑value upserts. The goal is to make Paimon the unified storage format for all data‑warehouse layers, further cutting storage costs and simplifying downstream analytics.
StarRocks official repository: https://github.com/StarRocks/StarRocks
StarRocks
StarRocks is an open‑source project under the Linux Foundation, focused on building a high‑performance, scalable analytical database that enables enterprises to create an efficient, unified lake‑house paradigm. It is widely used across many industries worldwide, helping numerous companies enhance their data analytics capabilities.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
