Big Data 24 min read

How Meituan Waimai Built and Evolved Its Massive Data Warehouse from V1 to V3

This article details Meituan Waimai's data warehouse evolution—covering business context, four‑layer architecture, Spark‑based ETL, successive V1.0, V2.0, and V3.0 redesigns, data governance practices, resource‑optimization tactics, security measures, and future road‑maps—illustrated with diagrams and concrete technical choices.

dbaplus Community

Aug 31, 2021

How Meituan Waimai Built and Evolved Its Massive Data Warehouse from V1 to V3

Business Context and Data Needs

Meituan Waimai aggregates user‑terminal logs, merchant data, sales, advertising, and algorithmic features. After unified processing, the data serve theme reports, self‑service analytics, and downstream applications for user, merchant, sales, advertising, and algorithm teams.

Overall Architecture

The data platform is organized into four logical layers:

Data Source Layer : Ingest raw client logs, service logs, business databases, corporate data, and third‑party feeds.

Data Processing Layer : Offline pipelines built with Spark and Hive; real‑time streams with Storm/Flink. The layer produces data marts for headquarters, traffic analysis, city teams, advertising, and algorithm features.

Data Service Layer : Stores data using open‑source components (MySQL, HDFS, HBase, Kylin, Doris, Druid, Elasticsearch, Tair) and exposes query APIs, reporting services, and data‑service endpoints.

Data Application Layer : Powers theme reports, self‑service tools, value‑added products, and analytical applications.

ETL on Spark

Since 2017 the offline pipeline migrated from Hive to Spark, reducing resource consumption by >20 %.

Key advantages of Spark:

Rich operator set enables complex business logic.

In‑memory iterative computation accelerates multi‑stage jobs.

Resource reuse across jobs lowers cluster demand.

Typical Spark SQL execution flow:

Client submits SQL → Parser builds AST → Catalog fetches metadata → Logical plan → Optimizer applies rules → Physical plan → Spark executors run the plan

Data Warehouse V1.0

Early design (pre‑2016) comprised five layers: ODS, Detail, Aggregate, Theme, and Application. The architecture supported rapid response for a small team but suffered from “silo” development as data volume and team size grew, leading to duplicated effort, inconsistent definitions, and high resource cost.

Data Warehouse V2.0 – Refactoring

To eliminate silos, V2.0 introduced a clearer layered model and split responsibilities between a Data Application Group (top‑down, application‑oriented) and a Data Modeling Group (bottom‑up, business‑oriented).

Standardized layers :

ODS – source ingestion.

IDL – integration, business‑process abstraction, schema shielding.

CDL – component layer, builds multi‑dimensional detail models and light aggregates.

MDL – data‑market layer, creates wide tables and summary tables for analysis.

ADL – application layer, selects query engines based on workload (OLAP, latency, concurrency).

Data source handling :

Business DBs – binlog sync (full, incremental, snapshot).

Traffic logs – unified SDK, quality monitoring.

Corporate data – secure warehouse with permission workflow.

Third‑party data – standardized cleaning before ingestion.

Data Warehouse V3.0 – Modeling Automation

V3.0 replaces manual table development with three modeling tools:

Base Modeling Tool : Captures business processes, table relationships, entities, and analysis objects in a metadata hub; automatically generates reusable data components.

Self‑Service Query Tool : Users select required metrics and dimensions, the system builds a logical wide table, matches the optimal component model, and emits the corresponding SQL. Query patterns feed back into the modeling hub to prioritize new components.

Application Modeling Tool : Consumes data components, performs dimension joins, aggregates, and constructs composite metrics for downstream applications.

These tools improve consistency, reduce development effort, and enable rapid iteration.

Data Governance

The governance framework consists of three pillars:

Data Standardization : A cross‑functional committee defines indicator and dimension standards to ensure unified definitions.

Systematic Standardization : Standards are embedded in a self‑built governance platform that includes data production tools, a corporate foundation platform, a metadata layer, and a data‑service layer.

System Integration : The platform connects to downstream systems such as reporting, data‑market, Dolphin data portal, anomaly analysis, CRM, algorithm platform, persona tagging, API services, and the corporate metadata hub.

Resource Optimization

Resources are allocated per tenant (e.g., warehouse, advertising, algorithm). Optimization rules target three dimensions:

Traffic : Decommission unused ODS tables, compress and serialize logs, enforce lifecycle retention.

Storage : Use ORC compression, manage hot/cold data lifecycles, and tune file formats.

Computation : Shut down idle tasks, consolidate common ETL logic, and share components across teams.

Cost monitoring tracks offline/real‑time compute, storage, ODS ingestion, and log usage; alerts trigger corrective actions via the data‑ops platform.

Data Security

Security controls include data masking, confidentiality levels (C1‑C4), permission workflows, and audit trails. Governance is divided into three phases:

Pre‑process : Mask sensitive fields and enforce access policies before data enters the warehouse.

In‑process : Detect and block risky SQL; require security approval for privileged queries.

Post‑process : Audit sensitive SQL, generate monthly reports, and monitor abnormal operations.

Future Planning

Strategic goals focus on both business growth (order volume, revenue) and technical excellence (high‑efficiency, low‑cost data services). Key directions:

Expand data coverage to include Meituan‑Deal, Dianping, and external sources.

Increase automation through modeling tools to boost development efficiency.

Strengthen capabilities: stability, data quality, timeliness.

Support operational decision‑making, data‑driven product monetization, algorithm feature delivery, and industry influence.

Implementation will leverage shared tooling with the corporate foundation platform, deeper integration of data applications, and continued enhancement of data standards, governance, and intelligent data‑ops.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

resource optimization ETL Data Governance Data Security

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.