Big Data 15 min read

Flink Real‑Time Data Warehouse Practices at Shopee Singapore Data Team

This article details Shopee Singapore Data Team’s implementation of a Flink‑based real‑time data warehouse, covering background challenges, layered architecture integrating Kafka, HBase, Druid, Hive, streaming pipelines, job management, monitoring, and future plans to expand Flink SQL support.

DataFunTalk
DataFunTalk
DataFunTalk
Flink Real‑Time Data Warehouse Practices at Shopee Singapore Data Team

Shopee, a leading e‑commerce platform in Southeast Asia and Taiwan, processes billions of orders annually, requiring a robust real‑time data infrastructure to support diverse business services such as orders, logistics, payments, and digital products.

Rapid data growth and increasing real‑time analytics demands created three major challenges: complex business dimensions requiring detailed and aggregated queries, a platform architecture lacking efficient scheduling, resource management, and low‑latency processing, and technical limitations of Spark Structured Streaming, which suffered from stateful processing complexity, lack of exactly‑once guarantees, and high failure‑recovery costs.

The team designed a multi‑layered data‑warehouse architecture. The collection layer ingests real‑time binlog and service logs into Kafka and HBase, while offline data is collected into HDFS. The storage layer combines Kafka for streaming messages, HDFS for Hive tables, and HBase for dimension data. Compute is handled by Spark, Flink, and Presto SQL. Above that, a scheduling layer manages resources, tasks, and job orchestration. The OLAP layer stores time‑series data in Druid, aggregated reports in Phoenix/HBase, and multi‑dimensional indexes in Elasticsearch. The top application layer serves reports, services, and user‑profile analytics.

In the Flink + Druid pipeline, three streams process order events. The first stream parses binlog events, filters today’s orders, deduplicates using ValueState, enriches with HBase/Phoenix dimension data, and writes successful records to a downstream Kafka topic and Druid; failed enrichments are routed to a slow‑topic for later handling. The second stream synchronizes slave binlog tables to HBase for dimension joins, addressing latency and hotspot issues. The third stream backfills failed orders by repeatedly processing them until all dimensions are resolved, using a custom window and trigger to batch processing and avoid endless loops.

State management uses Flink’s FSStateBackend with ValueState and a 24‑hour TTL, consuming ~2 GB of state during peak periods. Checkpointing runs every 10 seconds in exactly‑once mode, and Kafka’s exactly‑once producer ensures end‑to‑end consistency. Metrics from Flink report HBase access latency, cache sizes, and lag indicators for monitoring.

For logistics analysis, Flink integrates with Hive using retract streams and interval joins, maintaining seven‑day state in RocksDB and enriching with HBase‑derived user dimensions. Results are exported to HDFS and queried via Presto. The team plans to adopt Apache Hudi to reduce latency.

The existing Streaming SQL platform currently relies on Spark SQL, which lacks flexible window functions, robust state handling, async lookups, and two‑phase commit support. Consequently, the roadmap includes adding Flink SQL support to overcome these limitations and enable more complex real‑time analytics.

Job management is handled through a web UI that supports environment configuration (multiple Flink/Spark versions), task lifecycle operations (restart, stop, disable), checkpoint/savepoint recovery, resource tuning (memory, CPU), JAR dependency management, and HOCON‑based application configuration. Monitoring integrates Flink REST API status checks, automatic failure alerts via email, and future integration with Grafana/Prometheus.

Future plans focus on expanding Flink SQL adoption for unified batch‑stream processing, migrating existing Spark Structured Streaming jobs to Flink, and enhancing the Streaming SQL platform with Flink capabilities to improve performance and reduce development overhead.

real-timebig dataFlinkSQLStreamingData WarehouseShopee
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.