How to Tackle Outliers in Internet A/B Experiments: Methods, Pitfalls, and Practical Tips
This article explores why outliers appear in large‑scale internet A/B tests, explains their impact on experiment precision, compares traditional trim and winsorize techniques, reviews a range of statistical and machine‑learning detection methods, and offers practical recommendations for handling them in product experiments.
Background
Experimenters often encounter unstable traffic allocation, large fluctuations in historical metrics after multiple splits, and inconsistent filtering rules across business scenarios, leading to unexpected results or even sign reversals when special users are removed.
These issues are typically caused by outliers in the data.
We will discuss how to handle outliers in internet A/B testing scenarios.
Concept Analysis
From a strict academic perspective, there is no universal definition of an outlier; different fields adopt various definitions and detection methods based on purpose and data characteristics. Generally, an outlier is a sample that differs markedly from the rest of the dataset.
The classic 3‑sigma rule treats any observation beyond three standard deviations as an outlier, assuming a normal distribution. However, many internet metrics follow a power‑law distribution, making the 3‑sigma rule inappropriate (e.g., the top 1% of JD users generate most of the GMV, which is not a meaningful outlier).
Future research should focus on large‑sample, power‑law scenarios similar to internet A/B experiments, drawing on statistical methods from sociology, economics, and emerging algorithmic approaches for outlier detection and treatment.
Basic Causes of Outliers
Measurement errors during data collection (instrument error)
Individual differences within the population (sampling randomness)
Data fraud or cheating (e.g., fake orders)
Samples drawn from heterogeneous user groups (e.g., B‑side users in JD App)
1. Why A/B Experiments Need Outlier Treatment?
A small number of abnormal users can cause uneven traffic allocation between treatment and control groups.
Abnormal users often have extreme metric values, inflating overall variance, reducing experiment precision, and increasing the minimum detectable effect (MDE).
2. Limitations of Outlier Removal
Inability to identify business‑logic defined outliers or metric calculation errors.
Potential bias: removing abnormal users may also discard valid data, requiring larger sample sizes or variance‑reduction techniques such as ANCOVA or CUPED.
Traditional Statistical Methods – Trim & Winsorize for Subsidy Experiments
1. What Are Trim and Winsorize?
Both originate from classic survey analysis. Winsorizing replaces values beyond a chosen percentile with the percentile value, while trimming discards those extreme values.
2. Data Performance
After outlier handling, trimming generally reduces the sample mean more than winsorizing at the same percentile, but winsorizing achieves greater variance reduction for a comparable mean bias.
Winsorize shows a better trade‑off between bias (horizontal axis) and variance reduction (vertical axis).
The methods affect Type I and Type II errors as illustrated below.
3. Principle and Effect Comparison
If most samples are normal but dispersed, prefer Winsorizing to retain information.
If many dirty or cheating users exist, trimming can effectively suppress their influence.
When business input is limited, Winsorizing offers a conservative outlier‑handling approach.
4. Practical Recommendations
Tailor outlier filters to each metric and scenario, removing fraudulent or noisy data.
For key metrics, perform a one‑time analysis and, if effective, embed the procedure into the experimentation platform.
Risk‑Control Model Application
Example: On 2024‑09‑30, the control group showed an abnormal increase in average duration, distorting experiment observation.
Investigation revealed cheating users; after removing them via the risk‑control team, the anomaly disappeared. Server‑side reporting can also reduce exposure to abnormal users.
Introduction to Some Outlier Detection Methods
Common lightweight approach for experimentation platforms: when kurtosis exceeds a threshold, treat the top x% extreme values as outliers, defining the range via quantiles.
Kurtosis measures tail heaviness; high kurtosis indicates variance driven by rare extreme deviations. Formula: γ₂ = μ₄ / σ⁴ − 3, where μ₄ is the fourth moment.
When simple Z‑score fails, more sophisticated methods can be employed:
Method
Basic Principle
Advantages
Disadvantages
Boxplot
1.5 × IQR threshold
Intuitive, easy
Poor adaptivity for some distributions
Z‑score
Standardize data, apply 3‑sigma or Chebyshev
Simple, intuitive
Sensitive to outliers; assumptions on distribution
Grubbs
Mean and extreme values test
Reduces variance impact
Limited to small samples
Variance contribution
Find subset minimizing variance‑based loss
Simple, intuitive
NP‑hard, requires heuristics
EM algorithm
Fit mixture model, use density to flag outliers
Handles threshold selection
Strong priors needed, heavy computation
Median absolute deviation
Use median of absolute deviations
Robust to outliers
Poor adaptivity
ODIN/LOF/LOCI
Distance or density‑based detection
Adaptive to local density
Computationally expensive
Histogram/Parzen density
Estimate probability density around points
No prior distribution needed
Bandwidth selection critical, costly
Regression residuals
Fit prior model, use residuals
Intuitive when model is reliable
Requires strong prior, residuals affected by outliers
PCA
Project to lower dimensions, large norm indicates outlier
Handles high‑dimensional data
Choosing k and kernel is non‑trivial
Matrix factorization
Low‑rank approximation, large residuals flagged
Handles missing data
Limited to linear relationships
Autoencoder
Reconstruction error of neural net
Non‑linear, unsupervised
High computation
One‑Class SVM
Soft‑margin separation of normal class
Kernel‑based non‑linearity
Expensive kernel computation
Isolation Forest
Tree depth as anomaly score
Efficient, can use kurtosis for feature selection
Threshold selection difficult
Robust Random Cut Forest
Improved Isolation Forest with range‑based feature sampling
More efficient feature selection
Threshold selection difficult
Appendix
The relationship between experiment volatility and precision: using group means, variances, and sample sizes, the test statistic can be expressed as follows.
When variance (var(ATE)) increases while the effect (E(ATE)) stays constant, the T‑statistic decreases, making the result less likely to be significant. Reducing var(ATE) by adjusting sample variance or size improves precision.
JD Tech Talk
Official JD Tech public account delivering best practices and technology innovation.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.