Master AB Testing: Hypothesis Testing and Minimum Sample Size Made Simple
This article explains the statistical foundations of AB experiments—hypothesis testing and minimum sample size calculation—showing how to determine whether observed differences are real, how to control type‑I and type‑II errors, and how to plan experiments with sufficient power.
Introduction
AB testing is a powerful tool for data‑driven decision making. This article introduces the statistical foundations—hypothesis testing and minimum sample size—explaining how to draw rigorous conclusions after an experiment and how to plan experiments with enough power.
Scope of Discussion
The focus is on the most common hypothesis‑testing methods for ID‑based random split experiments: independent‑sample t‑test and independent‑sample z‑test.
When These Methods Do Not Apply
Coarse‑grained traffic allocation such as time‑slice or city‑group rotation (requires bootstrap).
Joint testing of multiple groups (requires ANOVA or chi‑square).
Mismatch between split unit and analysis unit (requires delta method or bootstrap).
Violations of the SUTVA assumption (treatment interference).
Hypothesis Testing Process
After collecting AB data, hypothesis testing determines whether the observed difference is merely random variation or a statistically significant effect. The basic workflow is illustrated below.
Key Concepts
Null hypothesis (H0) : the two variants have equal means (or proportions). Alternative hypothesis (H1) : the means (or proportions) differ.
In AB testing, H0 typically states that the treatment and control groups share the same population mean; H1 states that they differ.
Type I and Type II Errors
Type I error (α) is rejecting a true null hypothesis; Type II error (β) is failing to reject a false null hypothesis. The significance level α is usually set to 0.05, while the desired power (1‑β) is often 0.80.
Rejection Region and p‑value
The rejection region contains values of the test statistic that lead to rejecting H0. The p‑value is the probability of observing a result at least as extreme as the one obtained, assuming H0 is true. If p < α, H0 is rejected.
Two‑Sample Mean Test
Typical scenarios: order‑ID split to compare average order price, or user‑ID split to compare average completed orders. The test statistic is (mean₁ − mean₂) / standard error and follows a normal distribution when the sample size is large.
Two‑Sample Proportion Test
Typical scenarios: order‑ID split to compare cancellation rate, or user‑ID split to compare conversion rate. The test statistic uses the difference of proportions divided by its standard error.
One‑Sided vs Two‑Sided Tests
Two‑sided tests are most common because the direction of the effect is often unknown. If business logic strongly predicts a positive lift, a one‑sided test may be chosen.
Minimum Sample Size Calculation
To limit Type II error, a minimum sample size must be determined so that the experiment has enough power to detect the expected effect (MDE – Minimal Detectable Effect). The workflow includes setting α, desired power, estimating the baseline metric, and the expected lift.
Statistical Power and MDE
Power (1‑β) is the probability of correctly detecting a true effect. MDE is the smallest effect size that can be detected with the chosen α, power, total sample size, and allocation ratio. A smaller MDE means a more sensitive experiment.
Practical Workflow Example
Set α = 0.05 and power = 80%.
Identify the baseline metric (e.g., cancellation rate ≈ 35%).
Estimate the expected lift (e.g., reduce cancellation by 0.1–0.5 percentage points).
Calculate the required sample size for each city or region based on daily order volume.
Sample Size Tools (G*Power)
G*Power can be used for a priori calculations (determine required n), compromise calculations (balance α and β), criterion calculations (find α for given n), post‑hoc power analysis, and sensitivity analysis (compute MDE).
Summary
The article covered essential statistical concepts for AB experiments, including hypothesis testing, type I/II errors, significance level, power, and minimum sample size calculation. Understanding these basics enables data‑driven product decisions and prepares readers for more advanced topics in the AB testing white‑paper series.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
