Meituan Waimai Data Warehouse: Architecture Evolution, Governance, and Future Roadmap
The article details Meituan Waimai's offline data warehouse evolution from its initial V1.0 design through V2.0 improvements to the V3.0 modeling‑tool driven architecture, covering the four‑layer framework, Spark‑based ETL, data governance processes, resource optimization, security measures, and future development plans.
Introduction Meituan Waimai collects user, merchant, and operational data from multiple terminals, processes it uniformly, and provides data services for reporting, analysis, and downstream applications.
Business Role of the Data Team The data team supplies business data to user and merchant sides, provides front‑end display data, delivers features for advertising and algorithm teams, and supports city operations.
Overall Architecture The warehouse is divided into four layers: Data Source, Data Processing, Data Service, and Data Application.
Data Source Layer Ingests raw logs, business DBs, corporate data, and third‑party data.
Data Processing Layer Uses Spark and Hive for offline processing and Storm/Flink for real‑time streams, building data marts for headquarters, traffic, city teams, advertising, and algorithms.
Data Service Layer Stores data with open‑source components (MySQL, HDFS, HBase, Kylin, Doris, Druid, ES, Tair) and provides query, API, and reporting services.
Data Application Layer Supports dashboards, self‑service tools, and value‑added products.
ETL on Spark Since 2017 the team migrated most Hive jobs to Spark, achieving >20% resource savings. Spark advantages include rich operators, in‑memory iteration, and resource reuse. The Spark SQL execution flow: parse → catalog lookup → logical plan → optimizer → physical plan → cluster execution.
Data Warehouse V1.0 Early design (pre‑2016) featured ODS, Detail, Aggregate, Theme, and Application layers but suffered from low development efficiency, inconsistent metrics, and high resource cost as the team grew.
Data Warehouse V2.0 Introduced clearer division of labor (Data Application Group vs. Data Modeling Group), refined layer responsibilities (ODS, IDL, CDL, MDL, ADL), standardized modeling, and adopted Kylin/Doris for OLAP. However, integration and component layers converged while application and market layers expanded, leading to management challenges.
Data Warehouse V3.0 Replaced manual development with modeling tools: a foundational metadata tool for business processes and entity relationships, a self‑service query builder, and an application‑level modeling tool that composes components into final data products.
Data Governance
Development Process Requirement analysis → technical design → data development → report/API development.
Standardization Established indicator and dimension standards via a data‑standard committee.
System Integration Built a governance platform comprising data production tools, corporate infrastructure, metadata layer, and data service layer.
Downstream Integration Integrated with reporting systems, data marts, Dolphin portal, anomaly analysis, CRM, algorithm platform, data API services, and corporate metadata platform.
Resource Optimization Optimized flow, storage, and compute: decommissioned invalid ODS, compressed logs, used ORC compression, applied lifecycle management, and streamlined ETL tasks.
Data Security Implemented data masking, confidentiality levels (C1‑C4), permission controls, sensitive SQL interception, and audit reporting.
Future Planning Goals include supporting rapid business growth, delivering high‑efficiency, low‑cost data services, expanding data coverage, enhancing decision support, enabling data monetization, and improving algorithm efficiency. Implementation focuses on comprehensive data collection, higher efficiency via modeling tools, stronger capability (stability, quality), and advanced data management through standardized, systematic, and intelligent governance.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
