Big Data 18 min read

Shopee Data System Challenges and Apache Hudi Practices

Shopee tackled its data‑system bottlenecks by customizing Apache Hudi to provide unified stream‑batch integration, efficient state‑detail snapshots, and low‑latency wide‑table generation, using CDC‑based bootstrapping, COW/MOR tables, savepoints and partial updates, which cut latency to ten minutes, lowered resource use, and yielded several community‑backed enhancements.

Shopee Tech Team
Shopee Tech Team
Shopee Tech Team
Shopee Data System Challenges and Apache Hudi Practices

Lakehouse (Lake‑House) is an important development direction in the big‑data field, offering a unified stream‑batch processing model and new scenarios that address data timeliness, accuracy, and storage cost issues.

Several mainstream open‑source lakehouse solutions are still evolving, and real‑world deployments often encounter missing features or unsupported capabilities. Shopee has customized the open‑source Apache Hudi to meet enterprise‑grade requirements and add new features for internal business needs.

Typical problems in Shopee’s data‑system construction:

During database‑based data integration, the same data needs both stream and batch processing. The traditional approach of maintaining separate full‑load and CDC pipelines leads to low full‑load efficiency, high database load, and difficulty guaranteeing data consistency.

State‑table detail storage: Traditional batch snapshots lose the intermediate process information that is valuable for user behavior analysis, and storing full snapshots for data that changes only slightly each day wastes resources.

Wide‑table creation: Near‑real‑time wide tables built with batch processing suffer from high latency and excessive resource consumption.

Why Hudi was chosen:

Rich ecosystem support – Hudi provides strong streaming capabilities and integrates well with multiple big‑data engines (Flink, Spark, Presto).

Plug‑in ability – Hudi works with Shopee’s existing compute engines and allows different indexing strategies based on data importance.

Business‑fit – Most of Shopee’s datasets have unique identifiers, making Hudi’s design a natural match.

Practices in Shopee’s Hudi adoption:

Real‑time data integration

Shopee extracts change data from business databases using CDC‑like techniques, builds an ODS layer that supports both batch and near‑real‑time incremental processing, and performs an initial full‑load (Bootstrap) before applying CDC updates. Depending on user requirements, either COW or MOR tables are created.

Common issues and solutions:

When users configure large‑change datasets as COW tables, write amplification occurs. Monitoring is added to detect this configuration, and MOR‑based merge logic is used to support synchronous or asynchronous file updates.

The default Bloom filter can cause false‑positive existence checks. HBase Index is employed to resolve write‑performance problems for very large datasets.

After switching to Hudi‑based real‑time pipelines, data visibility improves dramatically: about 80% of data becomes visible within 10 minutes and 100% within 15 minutes, while resource consumption drops significantly compared with batch pipelines.

Incremental view (snapshot view)

For scenarios requiring state‑detail, Shopee leverages Hudi’s Savepoint feature to periodically generate snapshots that are stored as partitions in the metadata management system. Users can query these snapshots via Flink, Spark, or Presto using standard SQL.

Space‑usage formula (shown in code):

(1 + (t - 1) * p ) / t

where p is the proportion of changing data and t is the number of retained time periods. Lower p yields greater storage savings, especially for long‑term data.

Incremental computation

When building on Hudi MOR tables, Shopee supports batch, incremental, and near‑real‑time workloads simultaneously. Incremental joins and merges drastically reduce the amount of data processed.

Example of a Merge‑Into statement (SQL):

MERGE INTO target_table t0
USING SOURCE_TABLE s0
ON t0.id = s0.id
WHEN MATCHED THEN UPDATE SET
  t0.price = s0.price + 5,
  _ts = s0.ts;

Another generic Merge‑Into syntax:

MERGE INTO target_table_name [target_alias]
USING source_table_reference [source_alias]
ON merge_condition
[WHEN MATCHED [AND condition] THEN matched_action]
[WHEN NOT MATCHED [AND condition] THEN not_matched_action];

matched_action
{ DELETE |
  UPDATE SET * |
  UPDATE SET { column1 = value1 } [, ...] }

not_matched_action
{ INSERT * |
  INSERT (column1 [, ...]) VALUES (value1 [, ...]) }

Using Partial Update, only the changed columns are written to log files, and later merged to produce the wide table, reducing both storage and compute costs.

Community contributions

Shopee has contributed several features back to the Hudi community, including:

meta sync (RFC‑55)

snapshot view (RFC‑61)

partial update (HUDI‑3304)

FileSystemLocker (HUDI‑4065)

These contributions also involved fixing numerous bugs and improving overall stability.

Snapshot view details

Typical use cases:

Daily snapshots named compacted-YYYYMMDD for downstream reporting, with configurable retention.

Yearly archived snapshots yyyy-archived for data compression and compliance.

Pre‑production snapshots preprod‑xx for quality checks before release.

Key Hudi features used:

Time travel – query the table at a specific point in time.

Savepoint – protect a commit’s data from cleanup while allowing intermediate data to be purged.

Enhancements include extended savepoint metadata for tag/branch/lifecycle management and fine‑grained base‑file partitioning on Merge‑On‑Read tables using watermarks.

Multi‑source partial update

Shopee implements multi‑source column updates via a custom Payload class that merges records with the same key from different sources. This reduces shuffle and state pressure on the compute layer.

Challenges such as out‑of‑order and late events are addressed by adding multiple ordering values and using an event time (ordering value) column to ensure newer data overwrites older data.

Future work includes lock‑less multiple writers to enable concurrent multi‑job writes.

Summary and outlook

Shopee selected Apache Hudi as the core component of its lakehouse solution to address real‑time integration, incremental view, and incremental computation needs, achieving low latency (≈10 minutes), reduced compute and storage costs.

Upcoming improvements focus on supporting concurrent multi‑task writes and optimizing metadata/meta‑store performance by decoupling from HDFS.

real-time analyticsBig Datadata integrationlakehouseApache HudiMerge IntoIncremental Processing
Shopee Tech Team
Written by

Shopee Tech Team

How to innovate and solve technical challenges in diverse, complex overseas scenarios? The Shopee Tech Team will explore cutting‑edge technology concepts and applications with you.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.