How to Build a Trustworthy A/B Testing Platform for Complex Multi‑Side Marketplaces
This article explains how Meituan's fulfillment team designs, implements, and operates a reliable A/B testing platform for multi‑side markets, detailing statistical pitfalls, experiment types, traffic-splitting frameworks, and automated analysis pipelines to ensure credible results despite overflow effects, small samples, and fairness constraints.
Background
In the fulfillment (delivery) domain, statistical traps such as overflow effects and small‑sample bias can easily invalidate conclusions. A trustworthy A/B testing solution must therefore address both experiment design and analysis to guarantee reliable results.
AB Testing Fundamentals
A/B testing compares two strategies (A and B) by exposing parallel user groups to each and measuring the difference. Because only one outcome per user is observable, a counterfactual must be constructed—typically by building a control group whose feature means match those of the treatment group and then applying a significance test.
Experiment Types
Randomized Experiments – true random assignment; the industry gold standard.
Quasi‑Experiments – controlled assignment but not fully random (e.g., double‑difference).
Observational Studies – no control over assignment; used when fairness or technical constraints prevent randomization.
Key Statistical Pitfalls
Incorrect variance estimation when the sampling mechanism is non‑random, leading to false‑positive or false‑negative results.
Choosing significance tests that ignore non‑normal data distributions. Ron Kohavi (2014) recommends a sample size >30 only when skewness < 1; otherwise the central limit theorem may not hold.
Overflow effects where a policy applied to one region influences neighboring regions, biasing uplift estimates (see Figure 6).
Formulas for variance under independent vs. dependent sampling, and for relative‑uplift variance that treats the denominator as a random variable, are provided in the original article.
Traffic‑Splitting Frameworks
Two industry‑standard frameworks are compared:
Layer‑Domain Overlap Framework (Google, Microsoft, Facebook) – pre‑splits traffic into buckets (e.g., 10 equal parts) and assigns experiments to layers and domains. It enables high parallelism but requires large traffic volumes and rigid pre‑planning.
Constraint‑Based Framework (Uber, DoorDash) – experiments declare constraints; the platform automatically detects conflicts and prevents overlapping experiments that could interact.
Because fulfillment traffic is limited at the region level and business requirements evolve rapidly, the platform adopts the constraint‑based framework.
Designing Experiments for Fulfillment
Design must balance three goals:
Increase sample size by making experimental units small.
Reduce overflow by making units large enough to contain interacting entities.
Maintain fairness so that users are not disadvantaged by the experiment.
Example: a delivery‑range experiment expands a merchant’s service area, pulling orders from neighboring merchants. This inflates the apparent uplift (Figure 6). A rotating experiment that alternates treatment assignment across days can mitigate overflow, but it trades off sample size against independence.
Experiment Design Toolbox
Group‑selection utilities that suggest random, stratified, or rotating splits based on overflow risk.
Variance‑reduction modules (covariate adjustment, Delta method) that lower the minimum detectable effect (MDE).
Significance‑analysis engines that automatically pick t‑tests, Fisher exact tests, or bootstrap methods according to sample size and distribution.
Figures 14‑16 illustrate the UI for selecting templates, estimating MDE, and generating design reports.
Automated Analysis Engine
After data collection, the engine executes a five‑step pipeline:
Data diagnostics (e.g., SRM, homogeneity checks) to verify that experimental assumptions hold.
Effect estimation using the appropriate method (direct difference, difference‑in‑differences, etc.).
Variance calculation tailored to the sampling scheme (independent vs. clustered). For clustered data the engine applies the Delta method or simulation‑based variance estimation.
P‑value computation with the correct test:
Weich t for large samples where asymptotic normality is justified.
Fisher exact for samples < 30.
Bootstrap when the distribution deviates from normality.
Report generation with confidence intervals, guard‑rail checks, and diagnostic visualisations (Figures 17‑18).
Quality Control and Continuous Improvement
Before releasing a new experiment capability, data‑science engineers run hundreds of simulated A/A experiments. The resulting p‑values must be uniformly distributed between 0 and 1; any deviation triggers a rollback for further investigation (Figure 19).
Conclusion and Outlook
Thousands of fulfillment experiments have demonstrated that overflow, small samples, and fairness constraints are inseparable challenges. By codifying statistical best practices, providing a constraint‑based traffic splitter, and automating analysis, the platform enables rapid, data‑driven decision‑making while keeping error rates low. Future work includes opening these capabilities to a broader user base and integrating advanced causal‑inference models.
Code example
相关阅读:Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architect
Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
