Fundamentals 14 min read

How to Tackle Outliers in Internet A/B Experiments: Methods & Best Practices

This article explores why outliers destabilize online A/B tests, explains their statistical definitions, compares trimming and winsorizing techniques, reviews classic and machine‑learning detection methods, and offers practical guidance for applying these approaches to improve experiment reliability.

JD Cloud Developers
JD Cloud Developers
JD Cloud Developers
How to Tackle Outliers in Internet A/B Experiments: Methods & Best Practices

Background

When running online experiments, practitioners often encounter unstable traffic splits, large fluctuations in historical metrics after multiple splits, and results that deviate sharply from expectations, especially after removing a few special users. Different business scenarios also apply different metric‑filtering rules, making it unclear which metrics to trust.

These issues are typically caused by outliers in the experiment data.

Concept Analysis

From a strict academic perspective, there is no single definition of an outlier; it varies across domains based on purpose and data characteristics. Generally, an outlier is a sample that differs markedly from the rest of the dataset. The classic 3‑sigma rule flags any point beyond three standard deviations as an outlier, but this assumes a normal (or near‑normal) distribution, which poorly fits the power‑law metrics common in internet companies (e.g., the top 1% of JD users generate disproportionate GMV).

Future research will focus on large‑sample, power‑law scenarios similar to internet A/B tests, drawing on statistical methods from sociology, economics, and emerging algorithmic outlier‑detection techniques.

Basic Causes of Outliers

Measurement errors during data collection (instrument error).

Individual differences within the population (sampling randomness).

Data fraud or cheating (e.g., fake orders).

Samples sourced from heterogeneous user groups (e.g., B‑side users in the JD app).

Why Handle Outliers in AB Experiments?

1. Need for Outlier Processing

A small number of abnormal users can prevent perfectly balanced traffic splits, leading to uneven allocation between treatment and control groups.

Abnormal users often have extreme metric values, inflating overall variance, reducing experiment precision, and increasing the minimum detectable effect (MDE).

2. Limitations of Outlier Removal

Inability to identify business‑logic‑defined outliers or metric‑calculation errors.

Potential bias introduced by discarding valid samples, which may require larger sample sizes or variance‑reduction techniques such as ANCOVA or CUPED.

Traditional Statistical Methods: Trim & Winsorize

1. What Are Trim and Winsorize?

Both methods originate from early 20th‑century survey analysis to address small sample sizes and data collection errors.

Winsorizing : Replace values beyond a chosen percentile with the percentile value itself.

Trimming : Discard values beyond a chosen percentile.

2. Data Performance

After applying outlier handling, trimming generally yields a larger impact on sample mean and standard deviation at the same percentile, while winsorizing offers a better trade‑off between bias (mean estimation error) and variance reduction.

3. Principle & Effect Comparison

If the sample is mostly clean but dispersed, prefer Winsorizing to retain information.

If the sample contains many dirty or fraudulent users, trimming can effectively reduce their impact.

When business input is limited, Winsorizing provides a conservative outlier‑handling approach.

4. Business Recommendations

Tailor outlier handling to each metric and scenario, removing fraudulent or dirty data whenever possible.

Perform a one‑off analysis for key metrics and, once a robust solution is identified, embed it into the experimentation platform.

Risk‑Model Application

Risk models can help detect abnormal duration metrics. For example, an experiment on 2024‑09‑30 showed an unexpected increase in average session length in the control group. Investigation revealed cheating users; after removing them, the metric stabilized.

Server‑side reporting can also mitigate the entry of abnormal users into experiments.

Introduction to Outlier Detection Methods

Common lightweight approaches for experimentation platforms include flagging the top x % of extreme values when kurtosis exceeds a threshold, thereby defining an outlier range based on quantiles.

Kurtosis (γ₂) = μ₄ / σ⁴, where μ₄ is the fourth‑order moment.

When simple methods like Z‑score perform poorly, more sophisticated techniques can be employed:

Statistical & Probabilistic Models

Boxplot (1.5 × IQR rule) – intuitive but may fail for certain distributions.

Z‑score – works for normal data; extensions use Chebyshev, Hoeffding, or Mahalanobis distance for other cases.

Grubbs test – suitable for small samples.

Variance‑contribution based selection – computationally intensive.

EM algorithm – requires strong prior assumptions.

Median absolute deviation – robust but not adaptive.

Machine‑Learning‑Based Methods

Distance/Density methods (ODIN, LOF, LOCI) – adaptive but costly.

Histogram/Parzen window density estimation – simple, no prior distribution needed.

Regression‑residual based detection – depends on accurate prior models.

PCA – handles high‑dimensional data, but choosing k is non‑trivial.

Matrix factorization – addresses missing data, limited to linear cases.

Autoencoder – non‑linear, self‑supervised, computationally heavy.

One‑Class SVM – kernel‑based, computationally heavy.

Isolation Forest & Robust Random Cut Forest – efficient unsupervised tree‑based methods, threshold selection can be challenging.

Appendix

The relationship between experiment volatility and precision can be expressed using the variance of the average treatment effect (var(ATE)) and its expected value (E(ATE)). Larger var(ATE) reduces the t‑statistic, making results less likely to be statistically significant.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

statisticsA/B testingexperimental designoutlier detectionwinsorizetrimming
JD Cloud Developers
Written by

JD Cloud Developers

JD Cloud Developers (Developer of JD Technology) is a JD Technology Group platform offering technical sharing and communication for AI, cloud computing, IoT and related developers. It publishes JD product technical information, industry content, and tech event news. Embrace technology and partner with developers to envision the future.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.