Product Management 15 min read

How to Tackle Outliers in Internet A/B Experiments: Methods, Pitfalls, and Practical Tips

This article explores why outliers appear in large‑scale internet A/B tests, explains their impact on experiment precision, compares traditional trim and winsorize techniques, reviews a range of statistical and machine‑learning detection methods, and offers practical recommendations for handling them in product experiments.

JD Tech Talk
JD Tech Talk
JD Tech Talk
How to Tackle Outliers in Internet A/B Experiments: Methods, Pitfalls, and Practical Tips

Background

Experimenters often encounter unstable traffic allocation, large fluctuations in historical metrics after multiple splits, and inconsistent filtering rules across business scenarios, leading to unexpected results or even sign reversals when special users are removed.

These issues are typically caused by outliers in the data.

We will discuss how to handle outliers in internet A/B testing scenarios.

Concept Analysis

From a strict academic perspective, there is no universal definition of an outlier; different fields adopt various definitions and detection methods based on purpose and data characteristics. Generally, an outlier is a sample that differs markedly from the rest of the dataset.

The classic 3‑sigma rule treats any observation beyond three standard deviations as an outlier, assuming a normal distribution. However, many internet metrics follow a power‑law distribution, making the 3‑sigma rule inappropriate (e.g., the top 1% of JD users generate most of the GMV, which is not a meaningful outlier).

Future research should focus on large‑sample, power‑law scenarios similar to internet A/B experiments, drawing on statistical methods from sociology, economics, and emerging algorithmic approaches for outlier detection and treatment.

Basic Causes of Outliers

Measurement errors during data collection (instrument error)

Individual differences within the population (sampling randomness)

Data fraud or cheating (e.g., fake orders)

Samples drawn from heterogeneous user groups (e.g., B‑side users in JD App)

1. Why A/B Experiments Need Outlier Treatment?

A small number of abnormal users can cause uneven traffic allocation between treatment and control groups.

Abnormal users often have extreme metric values, inflating overall variance, reducing experiment precision, and increasing the minimum detectable effect (MDE).

2. Limitations of Outlier Removal

Inability to identify business‑logic defined outliers or metric calculation errors.

Potential bias: removing abnormal users may also discard valid data, requiring larger sample sizes or variance‑reduction techniques such as ANCOVA or CUPED.

Traditional Statistical Methods – Trim & Winsorize for Subsidy Experiments

1. What Are Trim and Winsorize?

Both originate from classic survey analysis. Winsorizing replaces values beyond a chosen percentile with the percentile value, while trimming discards those extreme values.

2. Data Performance

After outlier handling, trimming generally reduces the sample mean more than winsorizing at the same percentile, but winsorizing achieves greater variance reduction for a comparable mean bias.

Comparison chart
Comparison chart

Winsorize shows a better trade‑off between bias (horizontal axis) and variance reduction (vertical axis).

Bias‑variance trade‑off
Bias‑variance trade‑off

The methods affect Type I and Type II errors as illustrated below.

Error impact
Error impact

3. Principle and Effect Comparison

If most samples are normal but dispersed, prefer Winsorizing to retain information.

If many dirty or cheating users exist, trimming can effectively suppress their influence.

When business input is limited, Winsorizing offers a conservative outlier‑handling approach.

4. Practical Recommendations

Tailor outlier filters to each metric and scenario, removing fraudulent or noisy data.

For key metrics, perform a one‑time analysis and, if effective, embed the procedure into the experimentation platform.

Risk‑Control Model Application

Example: On 2024‑09‑30, the control group showed an abnormal increase in average duration, distorting experiment observation.

Abnormal duration spike
Abnormal duration spike

Investigation revealed cheating users; after removing them via the risk‑control team, the anomaly disappeared. Server‑side reporting can also reduce exposure to abnormal users.

Introduction to Some Outlier Detection Methods

Common lightweight approach for experimentation platforms: when kurtosis exceeds a threshold, treat the top x% extreme values as outliers, defining the range via quantiles.

Kurtosis measures tail heaviness; high kurtosis indicates variance driven by rare extreme deviations. Formula: γ₂ = μ₄ / σ⁴ − 3, where μ₄ is the fourth moment.

When simple Z‑score fails, more sophisticated methods can be employed:

Method

Basic Principle

Advantages

Disadvantages

Boxplot

1.5 × IQR threshold

Intuitive, easy

Poor adaptivity for some distributions

Z‑score

Standardize data, apply 3‑sigma or Chebyshev

Simple, intuitive

Sensitive to outliers; assumptions on distribution

Grubbs

Mean and extreme values test

Reduces variance impact

Limited to small samples

Variance contribution

Find subset minimizing variance‑based loss

Simple, intuitive

NP‑hard, requires heuristics

EM algorithm

Fit mixture model, use density to flag outliers

Handles threshold selection

Strong priors needed, heavy computation

Median absolute deviation

Use median of absolute deviations

Robust to outliers

Poor adaptivity

ODIN/LOF/LOCI

Distance or density‑based detection

Adaptive to local density

Computationally expensive

Histogram/Parzen density

Estimate probability density around points

No prior distribution needed

Bandwidth selection critical, costly

Regression residuals

Fit prior model, use residuals

Intuitive when model is reliable

Requires strong prior, residuals affected by outliers

PCA

Project to lower dimensions, large norm indicates outlier

Handles high‑dimensional data

Choosing k and kernel is non‑trivial

Matrix factorization

Low‑rank approximation, large residuals flagged

Handles missing data

Limited to linear relationships

Autoencoder

Reconstruction error of neural net

Non‑linear, unsupervised

High computation

One‑Class SVM

Soft‑margin separation of normal class

Kernel‑based non‑linearity

Expensive kernel computation

Isolation Forest

Tree depth as anomaly score

Efficient, can use kurtosis for feature selection

Threshold selection difficult

Robust Random Cut Forest

Improved Isolation Forest with range‑based feature sampling

More efficient feature selection

Threshold selection difficult

Appendix

The relationship between experiment volatility and precision: using group means, variances, and sample sizes, the test statistic can be expressed as follows.

Formula
Formula

When variance (var(ATE)) increases while the effect (E(ATE)) stays constant, the T‑statistic decreases, making the result less likely to be significant. Reducing var(ATE) by adjusting sample variance or size improves precision.

Variance impact
Variance impact
A/B testingstatistical methodsexperiment designoutlier detectionwinsorizetrimming
JD Tech Talk
Written by

JD Tech Talk

Official JD Tech public account delivering best practices and technology innovation.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.