Design and Implementation of an A/B Evaluation System for Meituan Delivery
This article describes how Meituan's delivery team built a comprehensive A/B testing evaluation platform, covering the motivation for a robust assessment framework, the architecture of the platform with three functional modules, the statistical methods for reliable experiment design, and the practical implementation details that enable data‑driven operational decisions.
On May 6, 2019, Meituan launched the new brand "Meituan Delivery" with a vision to complete one hundred million trustworthy deliveries daily, becoming an essential infrastructure for daily life. Today, the service supports over 4 million merchants, 400 million users, and more than 700,000 active couriers across 2,800+ cities.
The article begins by explaining why an evaluation system is needed and then details the thoughts and practices of Meituan Delivery's technical team in building an A/B evaluation framework, including how to establish a complete metric system and a scientific assessment method.
Instant delivery hinges on three elements—efficiency, cost, and experience—improved through fine‑grained strategy iteration. Decisions are no longer made arbitrarily; they rely on data‑driven feedback that indicates performance and potential growth.
A/B experiments serve as a powerful tool for such iteration. By defining multiple versions of a strategy, assigning them to comparable groups, and collecting experience and business data, the best version can be identified and adopted.
1. A/B Platform Overview
The platform consists of three modules that correspond to the three stages of the A/B lifecycle: experiment configuration management, traffic splitting and logging, and online analysis.
The workflow is illustrated as a closed loop: hypothesis → define success metrics → conduct A/B experiment → analyze and learn → release → formulate new hypothesis.
2. Why Emphasize Evaluation System Construction
Traditional A/B platforms use simple hash‑based traffic splitting, assuming independent and identically distributed traffic. In delivery scenarios, traffic involves users, couriers, and merchants, making requests interdependent and heavily influenced by offline factors. Therefore, Meituan adopts multiple splitting strategies, including layered models and AA grouping, to ensure statistically indistinguishable control and treatment groups.
Two main problems arise when relying on experimenters to define custom metrics: (1) lack of objectivity and potential bias toward supporting their hypothesis, and (2) misalignment with business goals, making results hard to adopt.
3. Building the A/B Evaluation System
The system addresses two core issues: a comprehensive, authoritative metric hierarchy (P0/P1 governance metrics and P2 exploratory metrics) and a scientific evaluation method based on hypothesis testing.
3.1 Authoritative Metric System
Governance metrics must be registered, reviewed, and produced by an independent data team to ensure authority and consistency. Exploratory metrics (P2) prioritize flexibility and rapid implementation.
Data integration combines experiment configuration, business data, and coloring data to enable both high‑level traffic metrics (PV, UV, conversion) and deep exploration of strategy impact.
3.2 Scientific Evaluation Method
Statistical hypothesis testing (including Z‑test, T‑test, and chi‑square) is used to verify experiment hypotheses. The process controls Type I error (false positive) as the primary concern, employing P‑values to decide whether to reject the null hypothesis.
AA grouping ensures that pre‑experiment traffic is split into control and treatment groups with no statistically significant differences, using dynamic programming to minimize metric variance.
Post‑experiment evaluation generates authoritative reports that are flexible (column‑to‑row transformation), convenient (drill‑down from experiment to entity level), and based on both governance and exploratory metrics.
4. Technical Implementation
The core of the architecture is a stable, flexible data retrieval service that bridges upstream applications and the metric system. Offline modeling and metadata management build the authoritative metric pool, while the retrieval service supplies metrics to various application services.
In summary, A/B testing has become the "gold standard" for evaluating new product strategies in many internet companies. In Meituan Delivery, it is widely used for dispatch, pricing, capacity optimization, and ETA prediction. Future work includes building auxiliary tools to recommend traffic scale based on metric sensitivity, ensuring statistically meaningful experiments.
DevOps
Share premium content and events on trends, applications, and practices in development efficiency, AI and related technologies. The IDCF International DevOps Coach Federation trains end‑to‑end development‑efficiency talent, linking high‑performance organizations and individuals to achieve excellence.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.