Fundamentals 14 min read

Statistical Foundations for A/B Testing: Populations, Samples, Confidence Intervals, and the Central Limit Theorem

This article explains the essential statistical concepts—populations, samples, sampling error, confidence intervals, the Central Limit Theorem, and normal distribution—that underpin A/B testing, showing how they enable reliable hypothesis evaluation, accurate impact prediction, and data‑driven decision making for product experiments.

Didi Tech
Didi Tech
Didi Tech
Statistical Foundations for A/B Testing: Populations, Samples, Confidence Intervals, and the Central Limit Theorem

Rapid and effective A/B experiments are essential for scaling business growth and improving user experience. The underlying "black‑tech" comes from statistics. This article introduces the key statistical concepts needed when using Apollo for A/B testing, helping readers design experiments, interpret results, and make data‑driven decisions. The content is split into two parts; this is the first part.

Why is statistics crucial for A/B testing? An A/B test is fundamentally a hypothesis‑testing process based on statistics. It formulates a hypothesis about the relationship between a treatment group and a control group, computes the data, and determines whether the observed difference is statistically significant. The goal is not just to evaluate a small sample of users but to predict the impact of a new solution when it is rolled out to the entire user base. Statistics provides the prior knowledge that allows us to estimate unseen outcomes with a certain degree of accuracy.

Because statistics can infer the whole population from limited data, it enables parallel testing of many ideas, dramatically increasing testing efficiency. Even if most experiments do not improve conversion, they still validate ideas and prevent costly failures.

Consider two machine‑learning models. We randomly assign 5,000 users to each model and observe conversion rates of 40% and 41% after one week. Although 41% > 40%, the sample of 10,000 users may not perfectly represent the entire population, introducing sampling bias. Statistics helps quantify how trustworthy this estimate is.

Key Terminology

Population (总体): The entire set of objects of interest (e.g., all users). If the experiment targets 10% of users, the remaining 90% together with the 10% form the population.

Sample (样本): A subset of the population used for the experiment (e.g., the 10% of users).

Sample Size (样本量): The total number of observations in the sample.

Sample Statistic (样本统计量): In A/B testing, this usually refers to the difference between treatment and control conversion rates (p₂‑p₁).

Sampling (抽样): The method of selecting a representative subset from the population, such as random sampling.

Distribution (分布): The probability distribution of a random variable. For example, the outcome of rolling a die follows a discrete uniform distribution.

Normal Distribution (正态分布): Also known as the Gaussian distribution, it is a symmetric bell‑shaped curve. About 68.2% of data lie within ±1σ of the mean, 95.4% within ±2σ, and 99.7% within ±3σ. Approximately 95% lie within ±1.96σ.

Bernoulli Distribution (伯努利分布): A binary distribution with only two possible outcomes (0 or 1), suitable for yes/no questions such as “Is the coin face up?” or “Is the user a conversion?”

Central Limit Theorem (中心极限定理): As the number of independent samples increases, the sampling distribution of the sample mean approaches a normal distribution, regardless of the original population distribution. This theorem underpins significance testing and confidence‑interval calculations.

When data follow a normal distribution, we can compute confidence intervals or p‑values directly. Even when the data are not normal, the CLT allows us to approximate these quantities.

Understanding confidence intervals and sampling error is the next step. In an ideal scenario, a sample perfectly represents the population (e.g., 10,000 out of 100,000 users convert, giving a 10% conversion rate). In practice, sample conversion may vary (e.g., 2% or 20%), leading to sampling error.

Sampling error measures the discrepancy between the sample statistic and the true population parameter. Larger sampling error means less accurate estimates, so we need additional metrics—confidence intervals and p‑values—to assess reliability.

How to Interpret a Confidence Interval

Suppose the control group conversion is 40% and the treatment group is 41% after 7 days. Rather than stating “the treatment is 1% higher,” a more precise statement is: “the treatment conversion is 0.8%–1.2% higher (1% ± 0.2%) with 95% confidence.” The ±0.2% represents the sampling error, defining the confidence interval.

A 95% confidence level means that if we repeated the experiment 100 times, about 95 of the resulting intervals would contain the true difference. The confidence level is chosen before the experiment (commonly 95%) and influences required sample size and significance testing.

Visual Explanation

Imagine estimating the proportion of people who like oranges. If the true proportion is 50% (p = 0.5) and we repeatedly draw samples of 100, each sample proportion ̅p̂ will vary. According to the normal distribution, about 95% of ̅p̂ values will fall within ±1.96σ of the true mean, where σ = √[p(1‑p)/n]. This interval provides a range where we are 95% confident the true proportion lies.

When a confidence interval does not contain a hypothesized value (e.g., 0.5), we reject that hypothesis, completing a significance test.

The next article will cover hypothesis testing, significance, p‑values, and statistical power in detail.

Reference

(Source: Dawson B, Trapp R G: Basic & Clinical Biostatistics , 4th Edition)

For further reading, see the Apollo article archive: Link

statisticsA/B testingconfidence intervalsamplingcentral limit theorem
Didi Tech
Written by

Didi Tech

Official Didi technology account

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.