Operations 7 min read

Hybrid Learning Beats Thresholds: Anomaly Detection for Millions of KPI Curves

The article recounts the author’s 2017‑onward journey building an intelligent operations platform at Tencent, detailing challenges such as legacy thresholds, AIOps talent shortage, and lack of frameworks, and explains how a two‑stage hybrid unsupervised‑supervised model was devised to automatically detect anomalies across millions of KPI time‑series, enabling scalable root‑cause analysis and cost optimization.

Efficient Ops
Efficient Ops
Efficient Ops
Hybrid Learning Beats Thresholds: Anomaly Detection for Millions of KPI Curves

Preface

The author first engaged with operations projects in August 2017, encountering business pain points such as Monitor time‑series anomaly detection, Hubble root‑cause analysis, ROOT system source analysis, fault troubleshooting, and cost optimization.

With a shortage of AIOps personnel and little academic research on these techniques, applying machine learning to operations became a major challenge.

Intelligent operations only emerged in 2017; previously most work was manual or DevOps‑based. Building an in‑house intelligent operations system faced three core difficulties: heavy legacy baggage, AIOps talent shortage, and the absence of a mature framework.

One Person Handles a Million Curves

External technologies were unavailable, so the team relied on self‑development, collaborating with business operations and development. The first intelligent‑ops project was Hubble’s multidimensional drill‑down analysis, which identifies the precise cause of success‑rate drops by examining metrics such as carrier, province, and device.

Research yielded several reference papers and resulted in a document titled “Exploration of Root‑Cause Analysis”. The same methodology can be applied to BI scenarios like diagnosing DAU or revenue declines.

Monitor’s original anomaly detection relied on manually set thresholds (maximum, minimum, volatility), leading to low accuracy, insufficient coverage, and high labor cost.

Manually configuring thresholds for millions of curves is impractical; weekly on‑call shifts still missed issues. Common algorithms considered included ARIMA, RNN/LSTM, and Facebook’s Prophet.

These algorithms typically model a single, relatively stable series, which does not suit the highly diverse KPI curves in practice.

After extensive research, a hybrid solution was proposed: an unsupervised layer filters most anomalies, followed by a supervised layer that improves precision and recall.

Because the curves vary widely, a single model is unsuitable. An ensemble approach was adopted, using the outputs of multiple models as features for a final classifier, achieving “one‑person handling a million curves” without manual thresholds.

What Is a Time Series

A time series is a sequence of statistical measurements ordered by time. Forecasting predicts future values based on observed trends, while anomaly detection identifies points that deviate significantly from normal patterns.

The author prepared a PPT covering models such as Moving Average, Exponential Weighted Moving Average, Control Chart Theory, the Opprentice system (Random Forest) for supervised anomaly detection, and feature extraction with tsfresh.

The author will present these techniques at the GOPS Global Operations Conference, covering time‑series anomaly detection, multidimensional root‑cause analysis, and ROOT system source analysis.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

machine learningOperationsanomaly detectionaiopsTime Series
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.