Fundamentals 15 min read

Master AB Testing: Hypothesis Testing and Minimum Sample Size Made Simple

This article explains the statistical foundations of AB experiments—hypothesis testing and minimum sample size calculation—showing how to determine whether observed differences are real, how to control type‑I and type‑II errors, and how to plan experiments with sufficient power.

Huolala Tech

Nov 24, 2023

Master AB Testing: Hypothesis Testing and Minimum Sample Size Made Simple

Introduction

AB testing is a powerful tool for data‑driven decision making. This article introduces the statistical foundations—hypothesis testing and minimum sample size—explaining how to draw rigorous conclusions after an experiment and how to plan experiments with enough power.

Scope of Discussion

The focus is on the most common hypothesis‑testing methods for ID‑based random split experiments: independent‑sample t‑test and independent‑sample z‑test.

When These Methods Do Not Apply

Coarse‑grained traffic allocation such as time‑slice or city‑group rotation (requires bootstrap).

Joint testing of multiple groups (requires ANOVA or chi‑square).

Mismatch between split unit and analysis unit (requires delta method or bootstrap).

Violations of the SUTVA assumption (treatment interference).

Hypothesis Testing Process

After collecting AB data, hypothesis testing determines whether the observed difference is merely random variation or a statistically significant effect. The basic workflow is illustrated below.

Key Concepts

Null hypothesis (H0) : the two variants have equal means (or proportions). Alternative hypothesis (H1) : the means (or proportions) differ.

In AB testing, H0 typically states that the treatment and control groups share the same population mean; H1 states that they differ.

Type I and Type II Errors

Type I error (α) is rejecting a true null hypothesis; Type II error (β) is failing to reject a false null hypothesis. The significance level α is usually set to 0.05, while the desired power (1‑β) is often 0.80.

Rejection Region and p‑value

The rejection region contains values of the test statistic that lead to rejecting H0. The p‑value is the probability of observing a result at least as extreme as the one obtained, assuming H0 is true. If p < α, H0 is rejected.

Two‑Sample Mean Test

Typical scenarios: order‑ID split to compare average order price, or user‑ID split to compare average completed orders. The test statistic is (mean₁ − mean₂) / standard error and follows a normal distribution when the sample size is large.

Two‑Sample Proportion Test

Typical scenarios: order‑ID split to compare cancellation rate, or user‑ID split to compare conversion rate. The test statistic uses the difference of proportions divided by its standard error.

One‑Sided vs Two‑Sided Tests

Two‑sided tests are most common because the direction of the effect is often unknown. If business logic strongly predicts a positive lift, a one‑sided test may be chosen.

Minimum Sample Size Calculation

To limit Type II error, a minimum sample size must be determined so that the experiment has enough power to detect the expected effect (MDE – Minimal Detectable Effect). The workflow includes setting α, desired power, estimating the baseline metric, and the expected lift.

Statistical Power and MDE

Power (1‑β) is the probability of correctly detecting a true effect. MDE is the smallest effect size that can be detected with the chosen α, power, total sample size, and allocation ratio. A smaller MDE means a more sensitive experiment.

Practical Workflow Example

Set α = 0.05 and power = 80%.

Identify the baseline metric (e.g., cancellation rate ≈ 35%).

Estimate the expected lift (e.g., reduce cancellation by 0.1–0.5 percentage points).

Calculate the required sample size for each city or region based on daily order volume.

Sample Size Tools (G*Power)

G*Power can be used for a priori calculations (determine required n), compromise calculations (balance α and β), criterion calculations (find α for given n), post‑hoc power analysis, and sensitivity analysis (compute MDE).

Summary

The article covered essential statistical concepts for AB experiments, including hypothesis testing, type I/II errors, significance level, power, and minimum sample size calculation. Understanding these basics enables data‑driven product decisions and prepares readers for more advanced topics in the AB testing white‑paper series.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

statistics hypothesis testing sample size experiment design

Written by

Huolala Tech

Technology reshapes logistics

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.