Fundamentals 59 min read

Unlocking Randomized Experiments: Advanced Techniques to Boost Test Power

This comprehensive guide explores the fundamentals of randomized controlled experiments, discusses classic RCT designs and their limitations, and presents advanced methods such as CUPED variance reduction, stratified, paired, and covariate‑adaptive randomization, as well as spill‑over modeling and random saturation designs to improve experimental power and reliability.

Meituan Technology Team

Jun 5, 2025

Unlocking Randomized Experiments: Advanced Techniques to Boost Test Power

Classic Randomized Controlled Experiments

Randomized controlled experiments (RCT) are the most basic and reliable A/B testing method. By randomly splitting the population into treatment and control groups, exchangeability ensures that, aside from the intervention, both groups share the same distribution of covariates, allowing the average treatment effect to be estimated as the difference between group means.

3.1.1 Limitations and Challenges

Fairness : Certain business scenarios require fairness to users or drivers, making pure random assignment infeasible.

Spill‑over Effects : Interaction between units (e.g., shared drivers) can bias results.

Small Sample Size : In many delivery scenarios only dozens to a few hundred units are available, making balance hard to achieve.

Business Impact : Large control allocations may affect live service performance, leading to extreme allocation ratios (e.g., 95:5) that reduce power.

Partial Triggering : Not all traffic receives the intervention, and the triggered flow may differ from the allocated flow.

3.1.2 Ordinary Random Assignment

Implemented via a hash‑based Bernoulli experiment (e.g., MurmurHash3_32). For each unit the probability of being in the treatment group is p, and in control is 1‑p. Evaluation uses hypothesis testing: the null hypothesis is equal means. Two main evaluation methods are:

Delta Method : Uses asymptotic normality to compute a test statistic and p‑value (p < 0.05 indicates significance).

Bootstrap : Resamples the data to estimate the distribution of the statistic, suitable for small samples or non‑normal data.

3.1.3 Complete Random Assignment

When the total number of units N and the desired treatment size n_T are known, exactly n_T units are randomly selected for treatment, guaranteeing the allocation ratio even in very small samples (e.g., 95:5). This avoids the imbalance that can occur with ordinary random assignment in tiny cohorts.

3.1.4 Statistical Pitfalls

Allocation Mechanism Pitfall : Ignoring the actual assignment rule (e.g., deterministic ID‑based splits) leads to incorrect variance estimates.

Metric Definition Pitfall : Different metric types (continuous, ratio, sum) require different evaluation formulas.

Test Method Pitfall : Large samples favor Delta, while small or non‑normal samples need non‑parametric tests such as Bootstrap.

Multiple Comparison Pitfall : Testing many metrics inflates the family‑wise error rate; corrections are necessary.

Independence Pitfall : When the analysis unit differs from the randomization unit (e.g., orders within users), independence may be violated.

3.1.5 Special Metric Types

For sum‑type metrics the absolute lift is computed as the difference of total sums, and variance formulas are adjusted accordingly. ROI‑type metrics are defined as (Y_T‑Y_C)/(C_T‑C_C), with variance derived via Delta or Bootstrap.

3.1.6 Supporting Functions

Sample Ratio Mismatch (SRM) Test : Checks whether the observed treatment‑control split matches the planned ratio. A chi‑square statistic is computed; a significant result indicates a violation of randomization.

Minimum Detectable Effect (MDE) and Sample Size : For a two‑sided test, MDE = z_{1‑α/2}+z_{1‑β} multiplied by the standard deviation of the metric divided by the square root of the sample size. Formulas differ for continuous and ratio metrics.

Improving Experiment Power

The most common way to increase power is to enlarge the sample, but variance‑reduction techniques can achieve the same effect with fewer users. CUPED (Controlled Experiment Using Pre‑Experiment Data) leverages a pre‑experiment covariate X that is highly correlated with the outcome Y. The adjusted estimator is Y_adj = Y - θ·(X‑E[X]), where θ = Cov(Y,X)/Var(X). This reduces variance proportionally to the squared correlation between Y and X.

CUPED for Continuous and Ratio Metrics

For continuous metrics, the adjusted metric is Y' = Y - θ·(X‑E[X]). For ratio metrics, three variants are proposed:

One‑coefficient adjustment : Same regression coefficient for numerator and denominator.

Two‑coefficient adjustment : Different coefficients for treatment and control groups.

Binary (two‑variable) adjustment : Separate regression for numerator and denominator.

Ensuring Homogeneity

Beyond simple randomization, several designs improve balance when sample size is limited or when stratification on key features is required.

Stratified Random Assignment

Units are first divided into strata based on attributes such as age, city size, or delivery capability. Within each stratum a completely random split is performed, guaranteeing balance on the stratifying variables and often reducing variance.

Paired Random Assignment

Units are paired on key characteristics; one member of each pair is randomly assigned to treatment and the other to control. This method is especially useful for very small samples and for scenarios where external factors (e.g., geography) heavily influence outcomes.

Covariate‑Adaptive Randomization

Assignments are made sequentially, each step choosing the group that minimizes an imbalance measure (e.g., Mahalanobis distance) of selected covariates. Two main schemes are used:

Fully sequential : Each unit is placed based on the current imbalance of both possible assignments.

Paired sequential : Units are processed in pairs, and the assignment that yields the smaller projected imbalance is chosen.

Addressing Spill‑over Effects

When treatment units affect control units (e.g., shared drivers across regions), the Stable Unit Treatment Value Assumption (SUTVA) is violated. Two approaches are described:

Region Spill‑in/Spill‑out Model

Metrics such as spill‑in weight, spill‑in intensity, spill‑out weight, and spill‑out intensity quantify the flow of orders and drivers between experimental and control regions. The model incorporates these intensities into the outcome equation, allowing the direct treatment effect to be separated from spill‑over contributions.

Randomized Saturation Design

Units are grouped into clusters, each assigned a saturation level (the proportion of treated units). By varying saturation across clusters, the design isolates the pure treatment effect from spill‑over effects, which are assumed to occur only within clusters.

Future Directions

Trigger‑based analysis and CACE (Complier Average Causal Effect) will be used to handle partial compliance and hidden treatment exposure. New variance‑reduction methods such as CUPAC, MLRATE, and STATE extend CUPED by incorporating machine‑learning predictions and distribution‑aware adjustments. Re‑randomization techniques that repeat assignment until covariate balance criteria are met are also being explored to further increase experimental power.

Finally, advanced spill‑over mitigation using Markov decision processes and other causal inference tools is planned to handle scenarios where physical isolation of clusters is impossible.

A/B testing CUPED statistical power Randomized Controlled Experiments Spillover Effects

Written by

Meituan Technology Team

Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.