Fundamentals 15 min read

Understanding Hypothesis Testing and Statistical Significance in A/B Experiments

The article explains hypothesis testing in A/B experiments, describing null and alternative hypotheses, type I and II errors, p‑values, statistical significance versus practical impact, confidence intervals, statistical power, sample‑size planning, and a checklist for interpreting results responsibly.

Didi Tech
Didi Tech
Didi Tech
Understanding Hypothesis Testing and Statistical Significance in A/B Experiments

This article continues the series on A/B experiment statistics, focusing on hypothesis testing, type I and II errors, statistical significance, and statistical power.

Hypothesis testing is based on the principle of falsification: first propose a null hypothesis (H 0 ) about a population parameter, then use sample data to decide whether to reject it. In A/B testing there are typically two hypotheses: the null hypothesis (H 0 ) and the alternative hypothesis (H 1 or H a ).

The goal of hypothesis testing is to reject the null hypothesis. This process is analogous to a jury trial: assume the defendant is innocent (null hypothesis) and collect evidence; if the evidence is sufficient, the null is rejected.

Example: suppose the null hypothesis is that 50% of people like oranges. An experiment yields a confidence interval of [80% ± 1.96·0.04], which does not contain 0.5. Because the interval excludes the null value with 95% confidence, we reject H 0 .

In A/B experiments the quantity of interest is the difference in conversion rates between the treatment and control groups (p₂ − p₁). The null hypothesis is p₂ − p₁ = 0 (no difference); the alternative is p₂ − p₁ ≠ 0 (a difference). Rejecting the null requires statistical significance.

Four possible outcomes arise, leading to two kinds of errors:

• Type I error (α) : rejecting a true null hypothesis. • Type II error (β) : failing to reject a false null hypothesis.

Type I Error and Statistical Significance – A Type I error means the experiment appears to improve the treatment when in fact there is no real difference. Controlling the probability of this error (α) is essential; industry commonly sets α = 5%, meaning we accept a 5% chance of a false positive. The p‑value is the probability of observing data at least as extreme as the current result under the null hypothesis. If p‑value < α, we reject H 0 and claim statistical significance.

What is a p‑value? Consider a fair coin (null hypothesis). The probability of getting heads n times in a row is (0.5) n . Observing a very small probability (e.g., 0.5⁵ = 0.03) leads us to doubt the fairness of the coin and reject the null.

For A/B tests, the p‑value is the probability of seeing the observed difference (or a larger one) if the two groups truly have the same conversion rate. A small p‑value indicates that the observed difference is unlikely under the null, prompting rejection of H 0 .

Statistical significance is not the same as practical significance. An experiment may show a statistically significant lift of 0.1% with p‑value = 0.001, yet the business impact could be negligible. Decisions should consider both statistical results and practical effect size.

Early in an experiment, statistical significance can fluctuate due to the novelty effect : users initially react strongly to a new change, inflating the apparent lift. As the novelty wears off, significance may decline and eventually stabilize. Sufficient sample size and experiment duration mitigate this effect.

Statistical significance can also be judged by confidence intervals: if the 95% confidence interval for p₂ − p₁ does not contain 0, the result is statistically significant.

Type II Error and Statistical Power – A Type II error occurs when the null hypothesis is false but we fail to reject it. Statistical power is the probability of correctly rejecting a false null (1 − β). A common target is power ≥ 80%, which keeps β ≤ 20%.

Detecting small true lifts requires larger sample sizes. By specifying a desired power (e.g., 80%) and an expected minimum detectable effect (MDE), one can calculate the minimum required sample size and compare it with the actual sample size achieved.

Checklist for Interpreting A/B Results

1. Sample size adequacy : Estimate the minimum sample size based on the target power and MDE, then verify that the experiment meets this requirement.

2. Observe actual lift : Compare the cumulative metric lift of the treatment versus control over the experiment period.

3. Check statistical significance : At the end of the experiment, ensure p‑value < 0.05 or the confidence interval excludes 0. Also monitor p‑value stability over time.

4. Combine lift and significance with business expectations : Even if the result is statistically significant, assess whether the observed lift meets the pre‑defined business threshold (MDE) and consider rollout costs.

This series has covered the statistical foundations of A/B testing—from sampling and the central limit theorem to significance and power—and aims to help practitioners make data‑driven decisions.

Correction Note : In the previous article, the standard deviation formula was misstated. The correct standard deviation for a Bernoulli proportion is sqrt(p·(1 − p)/n). For the orange example with p = 0.8 and n = 100, the standard deviation is sqrt(0.8·0.2/100) = 0.04, leading to a 95% confidence interval of [80% ± 1.96·0.04].

A/B Testingconfidence intervalhypothesis testingtype I errortype II errorstatistical powerstatistical significance
Didi Tech
Written by

Didi Tech

Official Didi technology account

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.