Evolution and Optimization of Meituan Waimai Offline Data Warehouse: Architecture, ETL, Modeling, Governance, and Future Plans
This article details the historical development, architectural layers, ETL migration to Spark, data modeling standards, governance processes, resource optimization, security measures, and future roadmap of Meituan Waimai's offline data warehouse, illustrating how the team addressed scalability and efficiency challenges.
Meituan Waimai's data warehouse collects user, merchant, and operational data from various terminals, processes it uniformly, and supports reporting, analysis, and downstream applications.
Architecture Overview : The warehouse is divided into four layers—Data Source, Data Processing, Data Service, and Data Application. The Data Source layer ingests raw logs, business databases, corporate data, and third‑party data. The Processing layer uses Spark and Hive for offline workloads and Storm/Flink for real‑time streams, producing multiple data marts (headquarters, traffic, city, advertising, algorithm).
ETL on Spark : Since 2017, most Hive jobs have migrated to Spark, achieving over 20% resource savings. Spark offers richer operators, in‑memory iteration, and resource reuse, improving efficiency.
Data Warehouse Versions : V1.0 (pre‑2016) featured a four‑layer ODS/Detail/Aggregate/Theme structure but suffered from low development efficiency, inconsistent metrics, and high resource costs as the team grew. V2.0 introduced clearer layering (ODS, IDL, CDL, MDL, ADL), standardized processes, and split responsibilities between Data Application and Data Modeling groups to reduce duplication.
Modeling Standards : The warehouse adopts a multi‑level model—ODS (raw), IDL (integration), CDL (components), MDL (data marts), and ADL (applications). Modeling emphasizes identifying analysis objects, defining boundaries, enriching attributes, and creating reusable components for downstream use.
Data Governance : A governance platform enforces data standardization, systematic implementation, and system integration. It includes data production tools, corporate infrastructure, metadata management, and service layers, supporting reporting, self‑service analytics, API services, and security compliance.
Resource Optimization : Resources are allocated per tenant (e.g., warehouse, advertising, algorithm). Optimization targets traffic (deleting unused ODS), storage (ORC compression, lifecycle management), and compute (removing idle tasks, consolidating common pipelines).
Security : Data is classified (C1‑C4), with pre‑, during‑, and post‑processing controls such as masking, permission checks, SQL interception, and audit reporting.
Future Plans : The roadmap focuses on expanding data coverage, improving efficiency via modeling tools, enhancing stability and quality, supporting business decisions, enabling data monetization, and advancing algorithmic feature delivery, all under a unified governance framework.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
