Product Management 17 min read

AA Testing and Rerandomization Techniques for Reliable AB Experiments

The article outlines how AA testing and rerandomization can detect and correct non‑uniform traffic splits in short‑term AB experiments, detailing three solutions—AA tests, seed‑based rerandomization, and retrospective AA analysis—along with theoretical guarantees, empirical error‑rate reductions, and remaining challenges for long‑term or clustered designs.

Didi Tech
Didi Tech
Didi Tech
AA Testing and Rerandomization Techniques for Reliable AB Experiments

Background : With the increasing demand for fine‑grained product operations, AB experiments are widely used to decide whether to launch new features or models. While AB tests usually provide trustworthy conclusions, they can be questioned due to Sample Ratio Mismatch (SRM) or AA problems. The AA problem asks whether traffic is split uniformly in a given AB experiment.

Scope of AB Experiments Considered : (1) Short‑term (≤ 1 month) random experiments that split users by groups; long‑term hold‑out, order‑split, or time‑slice experiments are excluded. (2) Only one treatment group and one control group are discussed (the analysis can be extended to multiple groups).

Definition of Uniform Allocation : Uniform allocation means that, if the experiment were not run, the two groups would show no significant difference in core metrics.

Why Uniform Allocation Matters : Uneven traffic can lead to conclusions that contradict the true effect because the experiment cannot directly observe the strategy’s effectiveness.

Three Main Solutions for the AA Problem :

1. AA Test : Run an AA test where the same strategy is applied to both groups. This reproduces the entire experiment workflow and helps uncover implementation traps. A checklist of common AA‑test issues is provided, and a tip notes that AA tests cannot detect problems that only appear in true AB experiments (e.g., network effects).

2. Rerandomization (also called “重随机”): Use historical data to generate many random hash seeds, evaluate each seed by the maximum absolute t‑statistic across core metrics, and select the seed with the smallest worst‑case statistic. Steps include: Extract user IDs and a core metric from a period before the planned experiment. Generate many random hash seeds. For each seed, assign users to groups and compute absolute t‑statistics for each metric. Take the maximum absolute t‑statistic per seed as its representative value. Choose the seed with the smallest representative value. Rerandomization reduces the probability of uneven splits but cannot eliminate it completely. Important considerations are population overlap, metric correlation, and the use of CUPED for variance reduction.

3. Retrospective AA Analysis : After the experiment, examine the performance of the same users during the AA period. This provides an additional check on the reliability of the AB result. The procedure is: Replace missing values with zero. Perform a two‑sample t‑test on the metric for the experimental groups. Use the resulting p‑value as an indicator of allocation uniformity. Tips warn that a significant difference in the retrospective AA analysis may cast doubt on the AB conclusion.

Theoretical Explanations :

• The p‑value distribution from a proper AA test should be uniform because the null hypothesis is true; this is illustrated with a coin‑flip analogy and can be verified with goodness‑of‑fit tests such as the Kolmogorov‑Smirnov test.

• Three failure patterns when the p‑value histogram is not uniform: (1) skewed distribution, (2) one or more peaks (outliers), (3) large gaps (discrete values). Each pattern is described with visual examples and statistical implications.

• Formal theorems (Theorem 1–5) describe how rerandomization aligns covariates, improves core metric balance, and affects variance estimation. Theorems show that using CUPED after rerandomization yields an unbiased variance estimate, whereas a naïve t‑test would over‑estimate variance.

Real‑World Validation : Experiments across business scenarios show that rerandomization can reduce the type‑I error rate from the nominal 5 % down to as low as 0.3 % when population overlap and metric correlation are high. In scenarios with missing AA data or low overlap, the benefit diminishes.

Open Issues :

Long‑term experiments where high‑correlation data are scarce.

Low population overlap – adaptive traffic allocation may help.

Clustered (stratified) experiments – rerandomization may not be optimal and computational efficiency becomes a concern.

References :

Patterns of Trustworthy Experimentation: Pre‑Experiment Stage

p‑Values for Your p‑Values: Validating Metric Trustworthiness by Simulated A/A Tests

Rerandomization and Regression Adjustment

A Randomization‑Based Theory for Preliminary Testing of Covariate Balance in Controlled Trials

AB testingstatistical analysisexperiment designAA testingCUPEDrerandomizationvariance estimation
Didi Tech
Written by

Didi Tech

Official Didi technology account

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.