Fundamentals 10 min read

Handling Outliers in Internet A/B Experiments: Concepts, Methods, and Practical Recommendations

The article explains why outliers destabilize internet A/B tests, outlines their causes, compares trimming and winsorizing, presents lightweight detection (e.g., kurtosis) and risk‑control strategies, and offers practical recommendations for bias‑aware removal and variance‑reduction techniques to improve experimental precision.

JD Retail Technology
JD Retail Technology
JD Retail Technology
Handling Outliers in Internet A/B Experiments: Concepts, Methods, and Practical Recommendations

Background – Practitioners often encounter unstable experiment traffic, large fluctuations in historical metrics after random grouping, unexpected results that change dramatically after removing a few special users, and inconsistent metric filtering rules across business scenarios. These issues are typically caused by outliers in AB experiments.

Conceptual Analysis – From an academic perspective, an outlier is a sample that deviates markedly from the rest of the data. No universal definition exists; the definition varies by domain, purpose, and data characteristics. The classic 3‑sigma rule (values beyond mean ± 3·standard deviation) assumes a normal distribution, which is unsuitable for the heavy‑tailed, power‑law metrics common in internet services.

Root Causes of Outliers

Measurement errors during data collection (instrument error).

Individual variability within a population (sampling randomness).

Data fraud or cheating (e.g., fake orders).

Mixed sample sources (e.g., B‑end users in a consumer app).

Why Remove Outliers in AB Experiments?

Small groups of extreme users can break the uniform random split, causing imbalance between treatment and control groups.

Extreme metric values inflate variance, increase the minimum detectable effect (MDE), and drown true effects in noise.

Limitations of Outlier Removal

Inability to identify business‑logic defined outliers (e.g., users flagged by risk rules) or metric calculation errors.

Potential bias: removing extreme points may discard valuable information, requiring larger sample sizes or variance‑reduction techniques such as ANCOVA or CUPED.

Traditional Statistical Methods: Trim & Winsorize

Both methods originate from classic survey analysis to mitigate the influence of extreme values.

Winsorizing (or tail‑capping): replace values beyond a chosen percentile with the percentile value.

Trimming (or tail‑removal): discard values beyond a chosen percentile.

Empirical comparison shows that trimming often yields larger bias in mean estimation, while winsorizing provides greater variance reduction for a comparable bias, making it a more robust choice when preserving sample information is important.

When the dataset contains many dirty or fraudulent users, trimming can effectively reduce their impact. In the absence of strong business signals, winsorizing offers a conservative, information‑preserving alternative.

Risk‑Control Model Application

Example: on 2024‑09‑30 a control group showed an abnormal surge in average session duration. Investigation revealed cheating users; after risk‑team removal, the anomaly disappeared. Server‑side reporting can also prevent such users from entering experiments.

Some Outlier Detection Methods

A lightweight approach suitable for experiment platforms: if kurtosis exceeds a threshold, flag the top x % extreme values as outliers. Kurtosis measures tail heaviness; high kurtosis indicates occasional extreme deviations.

When simple Z‑score methods fail, more sophisticated techniques (e.g., robust statistical models) can be employed. A brief overview of several detection algorithms is illustrated in the following diagram.

Appendix

Relationship between experiment variance and precision: using group means, variances, and sample sizes, the test statistic is

where the variance of the average treatment effect (var(ATE)) appears in the denominator. Larger variance reduces the t‑statistic, making the effect harder to detect; thus reducing var(ATE) via variance‑reduction techniques or larger samples improves experimental precision.

About Us

Stone (试金石) is JD Retail’s unified AB testing platform, providing data‑driven experiment design and analysis to enable reliable product and service optimization. The team is hiring for data science, data engineering, front‑end, back‑end, product, and other roles; interested candidates may email [email protected].

Big DataA/B testingstatistical methodsexperiment designoutlier detectionTRIMwinsorize
JD Retail Technology
Written by

JD Retail Technology

Official platform of JD Retail Technology, delivering insightful R&D news and a deep look into the lives and work of technologists.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.