How to Build a Trustworthy A/B Testing Platform for Complex Fulfillment Scenarios
This article presents a comprehensive guide to designing, implementing, and analyzing a reliable A/B testing platform for Meituan's multi‑side fulfillment business, covering statistical pitfalls, experiment types, traffic‑splitting frameworks, automated analysis engines, and practical solutions for overflow effects, small samples, and fairness constraints.
Background
A/B testing has a century‑old statistical foundation, but building a large‑scale, reliable platform is difficult because real‑world experiments must handle overflow effects, small‑sample regimes, bias‑variance trade‑offs, and correct statistical inference. Expert knowledge of experimental design and statistics is therefore essential for trustworthy results.
Key technical challenges
Ensuring comparable groups, independence between groups, and sufficient sample size to avoid false negatives or inflated effects.
Correct variance estimation when the sampling mechanism deviates from i.i.d. assumptions.
Choosing appropriate tests for non‑normal or small samples (e.g., Fisher’s exact test, bootstrap, non‑parametric methods).
Aligning experimental units with analysis units to prevent under‑estimated variance and false positives.
Traffic‑splitting frameworks
Two frameworks are deployed to allocate traffic in the fulfillment platform:
Layer‑Domain Overlap Framework : Pre‑splits traffic into a fixed number of buckets (e.g., ten) and assigns experiments to layers and domains. Works well for high‑volume, single‑side services but struggles with low‑volume, multi‑side scenarios because buckets cannot be evenly distributed and domain constraints reduce flexibility.
Constraint‑Based Framework : Experiments declare constraints (algorithm keys, scenarios, templates). The platform automatically detects conflicts and allocates traffic on demand, providing higher flexibility for limited traffic and heterogeneous regions.
Experiment design automation
The platform receives a scenario description (experiment type, unit, grouping method, evaluation metric) and outputs a concrete experiment design, eliminating the need for users to manually select statistical methods. Design tools include:
Sample‑size and minimum detectable effect (MDE) calculators.
Homogeneity checks and covariate balance diagnostics.
Pre‑selection of variance‑reduction techniques (e.g., covariate adjustment, Delta method).
Analysis engine
The engine automatically selects variance‑estimation, significance‑testing, and effect‑size calculation methods based on the experiment’s grouping, data distribution, and the relationship between experimental and analysis units. The analysis pipeline consists of:
Data diagnostics to verify assumptions (e.g., independence, distribution shape).
Effect estimation using difference‑in‑differences or simple mean differences.
Variance reduction (covariate adjustment, Delta method, clustering adjustments).
Choice of parametric or non‑parametric tests: Weich‑t for large samples with approximate normality, Fisher’s exact test for ultra‑small samples, bootstrap when normality is doubtful.
Generation of a comprehensive report that includes guard‑rail checks, SRM validation, and diagnostic visualizations.
Quality control
Before releasing new experiment capabilities, extensive AA simulations are run. Hundreds of simulated AA experiments verify that p‑values follow a uniform distribution and that variance estimates are unbiased. Only after passing these checks are new methods promoted to production.
Application to fulfillment (multi‑side markets)
In a three‑side marketplace (users, riders, merchants), experiments face:
Overflow effects : Changing a merchant’s delivery range shifts orders between control and treatment, inflating lift during the experiment but disappearing after rollout.
Small sample sizes : Regional or city‑level experiments often have limited traffic, reducing statistical power.
Fairness constraints : Policies must be applied uniformly across participants, limiting random assignment.
The constraint‑based traffic framework, combined with automated design and analysis, mitigates these issues by:
Allowing experimenters to specify conflict‑avoidance rules, so overlapping strategies do not share users.
Choosing appropriate experimental units (e.g., region, city) and grouping methods (random rotation, quasi‑experiment) to balance sample size against overflow risk.
Automatically applying variance‑reduction and correct test selection to maintain power despite limited data.
Future work
Planned extensions include opening the platform to a broader set of users, continuously refining statistical methods (linear models, Delta method, bootstrap), and adding newer causal‑inference techniques to further improve experiment reliability in complex multi‑side environments.
Meituan Technology Team
Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
