How a Leading E‑commerce Platform Built a Scalable Data Warehouse with Lambda & Hudi
This article explains how an e‑commerce company designed and implemented a modern data warehouse—combining batch Spark jobs, real‑time Flink streams, and Hudi data‑lake storage—to handle terabytes of daily logs, ensure data quality, and provide fast, reliable analytics for business decision‑making.
1. Overview
Data warehouse, defined by W.H. Inmon in 1990, stores massive historical data from OLTP systems for analysis using OLAP, data mining, supporting DSS and EIS, enabling decision makers to extract valuable information quickly.
2. Basic Data Warehouse Construction Goals
The e‑commerce platform generates terabytes of behavior logs and transaction data daily. Traditional OLTP engines cannot meet diverse analytical needs. The basic goal is to integrate various data sources, develop unified data models, and ensure data quality, following the “four‑one” standards: unified cleaning, naming, metrics, and specifications.
3. Industry Data Warehouse Evolution
3.1 Theory Development
(1) Early stage – 1970s MIT research on separating transaction and analysis layers.
(2) Exploration – 1980s DEC’s TA2 architecture defining data acquisition, access, catalog, and user services.
(3) Prototype – 1988 IBM’s Information Warehouse (VITAL) with 85 components.
(4) Consolidation – 1991 Bill Inmon’s book “Building the Data Warehouse” establishing the subject‑oriented, integrated, non‑volatile, time‑variant definition.
3.2 Architecture Evolution
(1) Traditional data warehouse architecture.
(2) Lambda architecture – three layers: Batch Layer (offline processing), Speed Layer (real‑time incremental processing), Serving Layer (merge results).
(3) Kappa architecture – simplified Lambda without batch layer, but with lower throughput for historical reprocessing.
3.3 Common Internet Data Warehouse Architectures
Examples include Meituan’s offline and real‑time layers and Youzan’s Lambda‑based architecture using Hive/SparkSQL for batch and Flink for real‑time, both storing raw data in HDFS and serving layers in Hudi.
4. Xingsheng Youxuan Data Warehouse Architecture & Practice
4.1 Data Lake‑Based Lambda Architecture
Because business models evolve rapidly, a pure Kappa architecture is unsuitable. The platform adopts a Lambda architecture with Spark batch processing storing data in Hudi (supporting upserts and low‑cost storage) and a Flink real‑time layer writing to Kafka and Hudi, achieving second‑level latency for real‑time and 5‑minute latency for batch.
4.2 Model Architecture
The data warehouse follows the typical layered model: ODS (source layer), DWD (detail layer), DIM (dimension layer), DWS (summary layer), ADS (application layer). Each layer’s purpose and characteristics are described.
4.3 Data Integration
Data sources include relational business tables, behavior logs, and processed result sets. Relational data are captured via binlog + Canal + SparkStream and upserted into Hudi. Log data are ingested with Flink filesystem connector using rolling and compaction strategies. Processed result sets are synchronized via a scheduling platform.
4.4 Data Standards
Data standards cover business terminology, naming conventions, data types, encoding rules, and ensure consistency, completeness, conformity, timeliness, uniqueness, and accuracy across the warehouse.
4.5 Data Quality
Quality checks include integrity, conformity, consistency, accuracy, uniqueness, and timeliness. Rules are executed synchronously after ETL jobs or asynchronously via scheduled triggers, with alerts for failures. Core business tables have 100% rule coverage.
4.6 Data Services
After construction, the warehouse provides data services such as ad‑hoc queries, data synchronization, APIs, and reports via DataStudio and Lingxi BI, supporting various consumption patterns.
5. Summary
The Xingsheng Youxuan data warehouse adopts industry‑standard layered modeling while customizing a Flink + Spark + Hudi architecture to meet e‑commerce real‑time and batch requirements. It achieves second‑level real‑time visibility, 5‑10 minute batch latency, and focuses on usability, stability, and accuracy for future enhancements.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Xingsheng Youxuan Technology Community
Xingsheng Youxuan Technology Official Account
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
