Big Data 16 min read

How a Leading E‑commerce Platform Built a Scalable Data Warehouse with Lambda & Hudi

This article explains how an e‑commerce company designed and implemented a modern data warehouse—combining batch Spark jobs, real‑time Flink streams, and Hudi data‑lake storage—to handle terabytes of daily logs, ensure data quality, and provide fast, reliable analytics for business decision‑making.

Xingsheng Youxuan Technology Community

Oct 14, 2022

How a Leading E‑commerce Platform Built a Scalable Data Warehouse with Lambda & Hudi

1. Overview

Data warehouse, defined by W.H. Inmon in 1990, stores massive historical data from OLTP systems for analysis using OLAP, data mining, supporting DSS and EIS, enabling decision makers to extract valuable information quickly.

2. Basic Data Warehouse Construction Goals

The e‑commerce platform generates terabytes of behavior logs and transaction data daily. Traditional OLTP engines cannot meet diverse analytical needs. The basic goal is to integrate various data sources, develop unified data models, and ensure data quality, following the “four‑one” standards: unified cleaning, naming, metrics, and specifications.

3. Industry Data Warehouse Evolution

3.1 Theory Development

(1) Early stage – 1970s MIT research on separating transaction and analysis layers.

(2) Exploration – 1980s DEC’s TA2 architecture defining data acquisition, access, catalog, and user services.

(3) Prototype – 1988 IBM’s Information Warehouse (VITAL) with 85 components.

(4) Consolidation – 1991 Bill Inmon’s book “Building the Data Warehouse” establishing the subject‑oriented, integrated, non‑volatile, time‑variant definition.

3.2 Architecture Evolution

(1) Traditional data warehouse architecture.

(2) Lambda architecture – three layers: Batch Layer (offline processing), Speed Layer (real‑time incremental processing), Serving Layer (merge results).

(3) Kappa architecture – simplified Lambda without batch layer, but with lower throughput for historical reprocessing.

3.3 Common Internet Data Warehouse Architectures

Examples include Meituan’s offline and real‑time layers and Youzan’s Lambda‑based architecture using Hive/SparkSQL for batch and Flink for real‑time, both storing raw data in HDFS and serving layers in Hudi.

4. Xingsheng Youxuan Data Warehouse Architecture & Practice

4.1 Data Lake‑Based Lambda Architecture

Because business models evolve rapidly, a pure Kappa architecture is unsuitable. The platform adopts a Lambda architecture with Spark batch processing storing data in Hudi (supporting upserts and low‑cost storage) and a Flink real‑time layer writing to Kafka and Hudi, achieving second‑level latency for real‑time and 5‑minute latency for batch.

4.2 Model Architecture

The data warehouse follows the typical layered model: ODS (source layer), DWD (detail layer), DIM (dimension layer), DWS (summary layer), ADS (application layer). Each layer’s purpose and characteristics are described.

4.3 Data Integration

Data sources include relational business tables, behavior logs, and processed result sets. Relational data are captured via binlog + Canal + SparkStream and upserted into Hudi. Log data are ingested with Flink filesystem connector using rolling and compaction strategies. Processed result sets are synchronized via a scheduling platform.

4.4 Data Standards

Data standards cover business terminology, naming conventions, data types, encoding rules, and ensure consistency, completeness, conformity, timeliness, uniqueness, and accuracy across the warehouse.

4.5 Data Quality

Quality checks include integrity, conformity, consistency, accuracy, uniqueness, and timeliness. Rules are executed synchronously after ETL jobs or asynchronously via scheduled triggers, with alerts for failures. Core business tables have 100% rule coverage.

4.6 Data Services

After construction, the warehouse provides data services such as ad‑hoc queries, data synchronization, APIs, and reports via DataStudio and Lingxi BI, supporting various consumption patterns.

5. Summary

The Xingsheng Youxuan data warehouse adopts industry‑standard layered modeling while customizing a Flink + Spark + Hudi architecture to meet e‑commerce real‑time and batch requirements. It achieves second‑level real‑time visibility, 5‑10 minute batch latency, and focuses on usability, stability, and accuracy for future enhancements.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Flink data warehouse ETL Data Lake Spark Lambda architecture Hudi

Written by

Xingsheng Youxuan Technology Community

Xingsheng Youxuan Technology Official Account

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.