The Evolution of Youzan’s Data Warehouse in a Big Data Environment
The article traces Youzan’s data warehouse from its chaotic early days lacking structure, through a 2016 Airflow‑driven construction phase that introduced layered ODS/DW/Data Mart architecture and naming standards, to a mature stage focused on efficiency, security, SparkSQL, dimensional modeling, metadata, and ongoing real‑time and governance challenges.
This article describes the development and standardization of Youzan's data warehouse (DW) within a big‑data ecosystem, outlining its definition, evolution, architecture, and operational practices.
It begins by defining a data warehouse as a core Business Intelligence system originally designed for management‑level decision support, and argues that in the era of Hadoop and Hive, a modern DW must also satisfy diverse business analytics needs.
The evolution is divided into three stages:
1. Chaos period – In the early days there was no layered architecture or naming conventions; a single Hive database (st) caused table name conflicts, lacked ETL tools, workflow concepts, scheduling platforms, data dictionaries, and lineage tracking. All processing was done with Python scripts and embedded SQL.
2. Construction period – Starting in 2016, Youzan introduced Airflow‑based scheduling and began formalizing the DW. The architecture was split into three layers: ODS (staging area), DW (presentation layer) and Data Mart (subject‑domain layer). The article details the responsibilities of each layer, naming rules (snake_case with domain prefixes), and the handling of table conflicts and business‑driven requirements.
3. Maturity period – After stabilizing the DW, the focus shifted to efficiency and usability. Youzan refined the layer definitions (ODS, DWS, DWA, DIM, TEMP), introduced task priority levels (P1‑P5), and emphasized security (library, table, and field‑level permissions). It also discusses the migration from MapReduce to SparkSQL, the adoption of dimensional modeling (star schema) versus wide tables, and the introduction of a metadata system and indicator library to avoid duplicate calculations.
The article further covers naming conventions for tables, fields, tasks, and workflows, the design of permission models, and the evolution of the compute engine. It highlights the importance of standardization to improve stability, reduce redundant computation, and increase data value.
Finally, the author notes ongoing challenges such as real‑time DW, data governance, quantitative assessment of DW work, and invites readers to join the Youzan data platform team.
Youzan Coder
Official Youzan tech channel, delivering technical insights and occasional daily updates from the Youzan tech team.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.