Fundamentals 10 min read

Handling Outliers in Internet A/B Experiments: Concepts, Methods, and Practical Recommendations

The article explains why outliers destabilize internet A/B tests, outlines their causes, compares trimming and winsorizing, presents lightweight detection (e.g., kurtosis) and risk‑control strategies, and offers practical recommendations for bias‑aware removal and variance‑reduction techniques to improve experimental precision.

JD Retail Technology

Jan 7, 2025

Handling Outliers in Internet A/B Experiments: Concepts, Methods, and Practical Recommendations

Background – Practitioners often encounter unstable experiment traffic, large fluctuations in historical metrics after random grouping, unexpected results that change dramatically after removing a few special users, and inconsistent metric filtering rules across business scenarios. These issues are typically caused by outliers in AB experiments.

Conceptual Analysis – From an academic perspective, an outlier is a sample that deviates markedly from the rest of the data. No universal definition exists; the definition varies by domain, purpose, and data characteristics. The classic 3‑sigma rule (values beyond mean ± 3·standard deviation) assumes a normal distribution, which is unsuitable for the heavy‑tailed, power‑law metrics common in internet services.

Root Causes of Outliers

Measurement errors during data collection (instrument error).

Individual variability within a population (sampling randomness).

Data fraud or cheating (e.g., fake orders).

Mixed sample sources (e.g., B‑end users in a consumer app).

Why Remove Outliers in AB Experiments?

Small groups of extreme users can break the uniform random split, causing imbalance between treatment and control groups.

Extreme metric values inflate variance, increase the minimum detectable effect (MDE), and drown true effects in noise.

Limitations of Outlier Removal

Inability to identify business‑logic defined outliers (e.g., users flagged by risk rules) or metric calculation errors.

Potential bias: removing extreme points may discard valuable information, requiring larger sample sizes or variance‑reduction techniques such as ANCOVA or CUPED.

Traditional Statistical Methods: Trim & Winsorize

Both methods originate from classic survey analysis to mitigate the influence of extreme values.

Winsorizing (or tail‑capping): replace values beyond a chosen percentile with the percentile value.

Trimming (or tail‑removal): discard values beyond a chosen percentile.

Empirical comparison shows that trimming often yields larger bias in mean estimation, while winsorizing provides greater variance reduction for a comparable bias, making it a more robust choice when preserving sample information is important.

When the dataset contains many dirty or fraudulent users, trimming can effectively reduce their impact. In the absence of strong business signals, winsorizing offers a conservative, information‑preserving alternative.

Risk‑Control Model Application

Example: on 2024‑09‑30 a control group showed an abnormal surge in average session duration. Investigation revealed cheating users; after risk‑team removal, the anomaly disappeared. Server‑side reporting can also prevent such users from entering experiments.

Some Outlier Detection Methods

A lightweight approach suitable for experiment platforms: if kurtosis exceeds a threshold, flag the top x % extreme values as outliers. Kurtosis measures tail heaviness; high kurtosis indicates occasional extreme deviations.

When simple Z‑score methods fail, more sophisticated techniques (e.g., robust statistical models) can be employed. A brief overview of several detection algorithms is illustrated in the following diagram.

Appendix

Relationship between experiment variance and precision: using group means, variances, and sample sizes, the test statistic is

where the variance of the average treatment effect (var(ATE)) appears in the denominator. Larger variance reduces the t‑statistic, making the effect harder to detect; thus reducing var(ATE) via variance‑reduction techniques or larger samples improves experimental precision.

About Us

Stone (试金石) is JD Retail’s unified AB testing platform, providing data‑driven experiment design and analysis to enable reliable product and service optimization. The team is hiring for data science, data engineering, front‑end, back‑end, product, and other roles; interested candidates may email [email protected].

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

A/B testing statistical methods experiment design outlier detection trim winsorize

Written by

JD Retail Technology

Official platform of JD Retail Technology, delivering insightful R&D news and a deep look into the lives and work of technologists.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.