Operations 16 min read

How Tencent Leverages AI to Simplify Massive-Scale Service Monitoring and Root‑Cause Analysis

Tencent's SNG social platform team tackles billion‑scale traffic by integrating AI‑driven anomaly detection, multi‑dimensional monitoring, and decision‑tree based root‑cause analysis, turning complex backend architectures and massive alert volumes into streamlined, actionable insights for faster issue resolution.

Efficient Ops
Efficient Ops
Efficient Ops
How Tencent Leverages AI to Simplify Massive-Scale Service Monitoring and Root‑Cause Analysis

Massive Business Challenges

Internet services demand extreme performance, reputation, and speed. Tencent's SNG social platform now serves billions of users, achieving >99.9% success rates, but faces challenges such as a historically complex backend architecture, huge alert volumes, and the need for threshold‑free, precise alarm handling.

AI Response

The operations team adopted AI techniques to address these challenges, building a system called “ZhiYan” that provides KPI anomaly detection, intelligent multi‑dimensional root‑cause analysis, and alarm correlation to improve developer and ops efficiency.

ZhiYun Monitoring System

Before introducing the new system, the existing ZhiYun monitoring framework is described.

Monitoring Types

Active monitoring – instrumentation in components or business code reports data to a central system (e.g., host status, module call monitoring, full‑link tracing).

Passive monitoring – no instrumentation; external probes simulate client requests and report results (e.g., return‑code monitoring, H5 speed tests).

Bypass monitoring – monitors without touching the service, such as public sentiment monitoring.

Data Sources

The system ingests data from ZhiYun Hubble multi‑dimensional analysis (ZhiYun Hubble), ZhiYun module call analysis (ZhiYun ModCall), and ZhiYun Business Life‑Dead Point monitoring (ZhiYun DLP).

Multi‑Dimensional Analysis

Client‑side logical instrumentation reports data to a unified interface, which is aggregated into multi‑dimensional statistics covering:

Basic dimensions such as province, ISP, network type, platform (Android/iOS), client version.

Business‑specific dimensions (e.g., live‑stream host vs. audience, command IDs).

Metric data such as return codes and latency.

Module Call Chain Analysis

The call chain monitors product, region, module, caller, callee, and interface, tracking success rate (based on return codes) and latency. Alerts are generated when a success‑rate threshold is breached for five consecutive minutes.

Business Life‑Dead Point Monitoring (DLP)

DLP monitors critical business indicators; alerts must be resolved immediately. A DLP metric can involve multiple module calls and other monitoring data. The system uses a 3‑sigma strategy (no fixed threshold) to flag anomalies and applies a half‑hour convergence window.

Technical Implementation

The anomaly detection module applies a 3‑sigma variant with a sliding‑window (5‑minute) average to smooth spikes. An alarm is raised after three consecutive abnormal points, with further convergence to reduce noise.

Root‑cause analysis uses decision‑tree models. After feature engineering (one‑hot encoding of categorical dimensions), the tree selects splits that maximize Gini impurity reduction, directly indicating the dimension where anomalies concentrate.

Feature engineering steps include one‑hot encoding of command IDs, expanding eight original dimensions to thousands of binary features. Positive and negative samples are balanced (1:1) to avoid bias.

Data sampling restores original counts (e.g., city‑level user counts) before feeding the model.

Root Cause Analysis

The decision tree outputs node statistics (negative/positive sample counts). Three metrics are derived:

Abnormal aggregation rate = negative / (positive + negative) for a node.

Abnormal detection rate = negative / total negative samples.

Weighted F1‑like score combining the two.

Nodes with highest scores are traced back to the root to produce a root‑cause path.

Association Analysis

Two aspects are covered:

Correlation between DLP alerts and other alarms.

Correlation between regular module‑call alerts.

Association rules are derived from historical alarm data. For module‑call alerts, a rule A→B is created when A and B co‑occur at least 30 times and the confidence exceeds 80 %. For DLP alerts, the support threshold is lower (≥3) due to their rarity, often yielding confidence near 100 %.

When a real‑time alarm matches a rule, an alarm chain is constructed, helping operators visualize the propagation path.

Summary

The AI‑driven system monitors KPI curves, detects a 0.24 % success‑rate drop, pinpoints the anomaly to disk‑related modules via decision‑tree root‑cause analysis, and correlates alarms to isolate the underlying issue (e.g., a cloud‑storage 2.0 fetch problem). This end‑to‑end pipeline shortens incident‑resolution time across client, access, and logic layers.

MonitoringAIOperationsAnomaly Detectiondecision treeroot cause analysis
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.