Big Data 20 min read

The Evolution of Youzan’s Data Warehouse in a Big Data Environment

The article traces Youzan’s data warehouse from its chaotic early days lacking structure, through a 2016 Airflow‑driven construction phase that introduced layered ODS/DW/Data Mart architecture and naming standards, to a mature stage focused on efficiency, security, SparkSQL, dimensional modeling, metadata, and ongoing real‑time and governance challenges.

Youzan Coder

Mar 18, 2020

The Evolution of Youzan’s Data Warehouse in a Big Data Environment

This article describes the development and standardization of Youzan's data warehouse (DW) within a big‑data ecosystem, outlining its definition, evolution, architecture, and operational practices.

It begins by defining a data warehouse as a core Business Intelligence system originally designed for management‑level decision support, and argues that in the era of Hadoop and Hive, a modern DW must also satisfy diverse business analytics needs.

The evolution is divided into three stages:

1. Chaos period – In the early days there was no layered architecture or naming conventions; a single Hive database (st) caused table name conflicts, lacked ETL tools, workflow concepts, scheduling platforms, data dictionaries, and lineage tracking. All processing was done with Python scripts and embedded SQL.

2. Construction period – Starting in 2016, Youzan introduced Airflow‑based scheduling and began formalizing the DW. The architecture was split into three layers: ODS (staging area), DW (presentation layer) and Data Mart (subject‑domain layer). The article details the responsibilities of each layer, naming rules (snake_case with domain prefixes), and the handling of table conflicts and business‑driven requirements.

3. Maturity period – After stabilizing the DW, the focus shifted to efficiency and usability. Youzan refined the layer definitions (ODS, DWS, DWA, DIM, TEMP), introduced task priority levels (P1‑P5), and emphasized security (library, table, and field‑level permissions). It also discusses the migration from MapReduce to SparkSQL, the adoption of dimensional modeling (star schema) versus wide tables, and the introduction of a metadata system and indicator library to avoid duplicate calculations.

The article further covers naming conventions for tables, fields, tasks, and workflows, the design of permission models, and the evolution of the compute engine. It highlights the importance of standardization to improve stability, reduce redundant computation, and increase data value.

Finally, the author notes ongoing challenges such as real‑time DW, data governance, quantitative assessment of DW work, and invites readers to join the Youzan data platform team.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Metadata Data Warehouse Hive ETL Data Governance Airflow

Written by

Youzan Coder

Official Youzan tech channel, delivering technical insights and occasional daily updates from the Youzan tech team.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.