Unlocking AB Testing: Core Statistical Principles Behind Reliable Experiments
This article explains the statistical foundations of AB testing, covering the Rubin causal model, SUTVA and randomization assumptions, parameter and confidence‑interval estimation, hypothesis‑testing procedures, and essential limit theorems such as the law of large numbers and the central limit theorem.
2.1 Experimental Foundations
AB testing originates from the Rubin causal model (also known as the potential‑outcome framework). In an idealized scenario, two parallel universes exist: one where every user receives the treatment (strategy B) and another where every user receives the control (strategy A). The individual causal effect is defined as the difference in outcomes for the same user across the two worlds, and the average treatment effect (ATE) is the mean of these differences.
In reality, a single user experiences only one of the two strategies. Therefore, experiments randomly split users into a treatment group and a control group, assuming that the groups are statistically identical in expectation. This random assignment allows the observed average outcomes of the two groups to serve as proxies for the parallel‑world averages.
Two key assumptions are required for unbiased causal inference:
Stable Unit Treatment Value Assumption (SUTVA) : the outcome of any user is unaffected by the treatment assignments of other users (no interference or spillover). Violations occur when, for example, a ride‑hailing app’s surge‑price experiment changes driver availability, indirectly affecting the control group.
Randomization : users are assigned to treatment or control purely by chance, independent of their behavior or characteristics. Non‑random selection (e.g., patients self‑selecting medication) breaks this assumption and biases the observed group means.
2.2 Statistical Foundations
2.2.1 Parameter Estimation
Parameter estimation uses sample data to infer unknown population parameters. Two main categories exist:
Point estimation : produces a single numeric estimate (e.g., defect rate = a/n). Good point estimators are unbiased (expected value equals the true parameter) and have low variance. The mean‑squared error (MSE) decomposes into variance plus squared bias.
Interval estimation : provides a range (confidence interval) that likely contains the true parameter. A 95 % confidence interval [a, b] means that, under repeated sampling, 95 % of such intervals would cover the true value. Shorter intervals are preferred when the confidence level is fixed.
In AB testing, the ATE is typically estimated as a point estimate; with sufficient sample size, the estimator is consistent and asymptotically unbiased, while its variance shrinks at a rate of 1/n.
2.2.2 Hypothesis Testing
Hypothesis testing evaluates whether observed differences could arise by chance under a null hypothesis (H₀). The standard workflow includes:
Formulate H₀ (no effect) and the alternative hypothesis H₁ (effect exists, often two‑sided).
Choose a significance level α (commonly 0.05), the tolerated probability of a Type I error.
Construct an appropriate test statistic (e.g., two‑sample t‑statistic for comparing conversion rates).
Determine the rejection region or compute the p‑value, the probability of observing a statistic as extreme as the one obtained if H₀ is true.
Decision: reject H₀ if p ≤ α (support H₁); otherwise, fail to reject H₀.
In practice, AB experiments often use the two‑sample t‑test, with variance estimated via methods such as the Delta method, bootstrap, or jackknife.
2.2.3 Limit Theorems
Limit theorems justify the approximations used in hypothesis testing and confidence intervals:
Strong Law of Large Numbers : the sample mean converges almost surely to the population mean as the sample size approaches infinity.
Central Limit Theorem (Lindeberg‑Levy) : the standardized sample mean converges in distribution to a standard normal distribution, enabling normal‑based inference even when the underlying data are not normal.
Delta Method : approximates the distribution of a smooth function of an estimator by linearizing it via a Taylor expansion, yielding asymptotic normality for transformed parameters.
Slutsky’s Theorem : describes how sums, products, and quotients of convergent random variables behave, allowing combination of consistent estimators.
2.3 Common Experiment Terminology
Key terms include treatment group, control group, lift, statistical power, confidence level, p‑value, and many others that are essential for interpreting AB test results.
Meituan Technology Team
Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
