Big Data 26 min read

Meituan Waimai Traffic Data Collection, Data Warehouse Construction, and Application Practices

This article details Meituan Waimai's traffic data collection history, the design and implementation of its large‑scale data warehouse—including ODL, IDL, CDL, MDL, and DIM layers—along with attribution modeling, data governance, and practical applications for analytics and product development.

DataFunTalk
DataFunTalk
DataFunTalk
Meituan Waimai Traffic Data Collection, Data Warehouse Construction, and Application Practices

The presentation introduces Meituan Waimai's traffic data collection evolution starting from 2015, describing early simple PV/UV logging, the 2016 comprehensive server‑side logging schema, and the 2017 adoption of the group-wide client‑side logging system with three components: point‑of‑interest management, developer platform, and event analysis tools.

It explains the two main types of instrumentation—frontend (code, visual, and no‑code) and backend—highlighting the advantages of code‑based frontend instrumentation such as high customizability and precise data capture.

The article then outlines the end‑to‑end data pipeline: requirement → configuration → instrumentation → QA testing → log ingestion (real‑time via Nginx, Flume, Kafka) → real‑time processing → offline log tables, followed by heavy offline processing (deduplication, joins, labeling) to produce company‑wide data tables.

Key data‑warehouse layers are described:

ODL (Operational Data Layer): raw log landing, basic field cleaning, attribution, and common dimension construction.

IDL (Integrated Data Layer): domain‑level modeling, entity and behavior relationships, and dimension flattening.

CDL (Component Data Layer): analysis entity modeling and metric generation.

MDL (Mart Data Layer): aggregated tables for business use, such as merchant traffic wide tables.

ADL (App Data Layer): application‑specific data not shared externally.

DIM (Dimension Layer): environment and thematic dimensions for consistent analysis.

Construction principles emphasized include high cohesion & low coupling, sinking common processing logic to lower layers, and balancing cost with performance by limiting data redundancy and using view‑based storage where possible.

Dimension building is split into environment dimensions (e.g., device, OS, app) generated via proxy keys using UDFs, and thematic dimensions that map raw log identifiers to business concepts such as resource slots, enabling precise behavior tracking.

Attribution is standardized as counting behavior B occurrences after behavior A, implemented by adding a "linkage information" array field to each log row via Hive UDFs that maintain a stack of prior actions, filter out page‑back events, and store causal predecessors.

Sample pseudo‑SQL for attribution:

select count(1) from 日志 where 行为 = B and 链路信息 包含 A

The article compares the "linkage information" approach (space‑for‑time) with the "target‑event" approach (time‑for‑space), noting their complementary nature.

Data governance is addressed through lineage tracking, scoring of instrumentation usage across ODL/IDL/CDL/MDL/ADL layers, and systematic pruning of low‑value points.

Finally, the article outlines three major application scenarios: OLAP analysis (event, page, resource slot, A/B testing), user‑behavior analysis (funnels, paths, retention, segmentation) primarily using Doris, and tag‑based data services for algorithms and marketing.

Future directions include real‑time data‑warehouse development, enhanced monitoring with predictive algorithms, and a unified self‑service analytics platform.

Big Datadata collectionData WarehouseAttributionETLMeituanTraffic Analytics
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.