Product Management 17 min read

AA Testing and Rerandomization Techniques for Reliable AB Experiments

The article outlines how AA testing and rerandomization can detect and correct non‑uniform traffic splits in short‑term AB experiments, detailing three solutions—AA tests, seed‑based rerandomization, and retrospective AA analysis—along with theoretical guarantees, empirical error‑rate reductions, and remaining challenges for long‑term or clustered designs.

Didi Tech

Apr 10, 2025

AA Testing and Rerandomization Techniques for Reliable AB Experiments

Background : With the increasing demand for fine‑grained product operations, AB experiments are widely used to decide whether to launch new features or models. While AB tests usually provide trustworthy conclusions, they can be questioned due to Sample Ratio Mismatch (SRM) or AA problems. The AA problem asks whether traffic is split uniformly in a given AB experiment.

Scope of AB Experiments Considered : (1) Short‑term (≤ 1 month) random experiments that split users by groups; long‑term hold‑out, order‑split, or time‑slice experiments are excluded. (2) Only one treatment group and one control group are discussed (the analysis can be extended to multiple groups).

Definition of Uniform Allocation : Uniform allocation means that, if the experiment were not run, the two groups would show no significant difference in core metrics.

Why Uniform Allocation Matters : Uneven traffic can lead to conclusions that contradict the true effect because the experiment cannot directly observe the strategy’s effectiveness.

Three Main Solutions for the AA Problem :

1. AA Test : Run an AA test where the same strategy is applied to both groups. This reproduces the entire experiment workflow and helps uncover implementation traps. A checklist of common AA‑test issues is provided, and a tip notes that AA tests cannot detect problems that only appear in true AB experiments (e.g., network effects).

2. Rerandomization (also called “重随机”): Use historical data to generate many random hash seeds, evaluate each seed by the maximum absolute t‑statistic across core metrics, and select the seed with the smallest worst‑case statistic. Steps include:

Extract user IDs and a core metric from a period before the planned experiment.

Generate many random hash seeds.

For each seed, assign users to groups and compute absolute t‑statistics for each metric.

Take the maximum absolute t‑statistic per seed as its representative value.

Choose the seed with the smallest representative value.

Rerandomization reduces the probability of uneven splits but cannot eliminate it completely. Important considerations are population overlap, metric correlation, and the use of CUPED for variance reduction.

3. Retrospective AA Analysis : After the experiment, examine the performance of the same users during the AA period. This provides an additional check on the reliability of the AB result. The procedure is:

Replace missing values with zero.

Perform a two‑sample t‑test on the metric for the experimental groups.

Use the resulting p‑value as an indicator of allocation uniformity.

Tips warn that a significant difference in the retrospective AA analysis may cast doubt on the AB conclusion.

Theoretical Explanations :

• The p‑value distribution from a proper AA test should be uniform because the null hypothesis is true; this is illustrated with a coin‑flip analogy and can be verified with goodness‑of‑fit tests such as the Kolmogorov‑Smirnov test.

• Three failure patterns when the p‑value histogram is not uniform: (1) skewed distribution, (2) one or more peaks (outliers), (3) large gaps (discrete values). Each pattern is described with visual examples and statistical implications.

• Formal theorems (Theorem 1–5) describe how rerandomization aligns covariates, improves core metric balance, and affects variance estimation. Theorems show that using CUPED after rerandomization yields an unbiased variance estimate, whereas a naïve t‑test would over‑estimate variance.

Real‑World Validation : Experiments across business scenarios show that rerandomization can reduce the type‑I error rate from the nominal 5 % down to as low as 0.3 % when population overlap and metric correlation are high. In scenarios with missing AA data or low overlap, the benefit diminishes.

Open Issues :

Long‑term experiments where high‑correlation data are scarce.

Low population overlap – adaptive traffic allocation may help.

Clustered (stratified) experiments – rerandomization may not be optimal and computational efficiency becomes a concern.

References :

Patterns of Trustworthy Experimentation: Pre‑Experiment Stage

p‑Values for Your p‑Values: Validating Metric Trustworthiness by Simulated A/A Tests

Rerandomization and Regression Adjustment

A Randomization‑Based Theory for Preliminary Testing of Covariate Balance in Controlled Trials

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AB testing statistical analysis experiment design AA testing CUPED rerandomization variance estimation

Written by

Didi Tech

Official Didi technology account

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.