Metric Anomaly Detection and Diagnosis Practices at NetEase Yanxuan
This article presents NetEase Yanxuan's end‑to‑end approach for automatically detecting and diagnosing metric anomalies in e‑commerce, covering background motivation, three types of anomalies, statistical detection frameworks (GESD, volatility, Mann‑Kendall), post‑processing, contribution‑decomposition methods, dimension‑explosion challenges, optimization techniques, and a brief Q&A.
Introduction Metrics are critical for business health; rapid, accurate anomaly detection helps identify and resolve issues promptly. NetEase Yanxuan shares its practice of building an automated, user‑independent, universal, timely, and accurate metric anomaly detection and diagnosis system.
Background With fast‑moving e‑commerce logic and a growing number of diverse metrics, manual threshold setting is error‑prone and costly. The goal is an automated solution that requires no manual rule definition, works across varied metric distributions, operates at day‑ and hour‑level granularity, and provides precise, proactive alerts.
Metric Anomaly Detection
Absolute value anomalies: outliers that deviate from the inherent distribution.
Volatility anomalies: sudden spikes or drops in period‑over‑period changes.
Trend anomalies: long‑term upward or downward shifts indicating potential risks.
The detection framework is unsupervised and statistical. Absolute anomalies use the GESD test, which iteratively removes the most extreme sample, computes a statistic R_i, and compares it to a critical value λ_i derived from the t‑distribution. Volatility anomalies locate inflection points in the volatility curve via second‑order derivatives. Trend anomalies employ the non‑parametric Mann‑Kendall test, calculating statistic S and converting it to a Z‑score to assess significance.
Post‑processing steps reduce false alarms: (1) filtering data‑driven anomalies caused by prior period fluctuations, and (2) suppressing alerts during known large‑scale promotions where anomalies are expected.
Metric Anomaly Diagnosis
Deterministic inference: clear, white‑box conclusions.
Probabilistic inference: machine‑learning regression, SHAP values, or Bayesian networks (limited interpretability).
Speculative inference: expert judgment (out of scope).
Diagnosis is organized into three layers—deterministic, possible, and speculative—each matched with appropriate methods.
Contribution Decomposition
Three formulas decompose a target metric Y (e.g., GMV) into contributions from sub‑metrics X_i: additive (ΔX_i / Y₀), multiplicative (LMDI logarithmic mean method), and divisional (separating volatility and structural effects). The decomposition is additive, satisfying MECE, and enables precise root‑cause attribution.
Dimension‑Explosion Problem & Optimizations
Generating intermediate tables for every possible dimension combination leads to exponential storage and compute costs. Optimizations include:
Aggregating contributions on‑the‑fly instead of materializing all intermediate tables.
Pruning dimension combinations using business‑driven limits and hierarchical grouping.
Ranking dimensions by a Gini‑coefficient‑based metric to select the most informative splits.
These steps reduce space complexity dramatically while preserving diagnostic accuracy.
QA
Q1: Accuracy is evaluated via deterministic diagnosis validation and business‑level bad‑case collection. Q2: Additive and multiplicative decompositions can be mixed using a greedy search that selects the next best dimension based on contribution drop, with practical preference for additive methods in the NetEase Yanxuan scenario.
Overall, the presented system demonstrates a scalable, statistically grounded solution for metric anomaly detection and root‑cause analysis in large‑scale e‑commerce environments.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.