How Tencent Leverages AI to Simplify Massive-Scale Service Monitoring and Root‑Cause Analysis
Tencent's SNG social platform team tackles billion‑scale traffic by integrating AI‑driven anomaly detection, multi‑dimensional monitoring, and decision‑tree based root‑cause analysis, turning complex backend architectures and massive alert volumes into streamlined, actionable insights for faster issue resolution.
Massive Business Challenges
Internet services demand extreme performance, reputation, and speed. Tencent's SNG social platform now serves billions of users, achieving >99.9% success rates, but faces challenges such as a historically complex backend architecture, huge alert volumes, and the need for threshold‑free, precise alarm handling.
AI Response
The operations team adopted AI techniques to address these challenges, building a system called “ZhiYan” that provides KPI anomaly detection, intelligent multi‑dimensional root‑cause analysis, and alarm correlation to improve developer and ops efficiency.
ZhiYun Monitoring System
Before introducing the new system, the existing ZhiYun monitoring framework is described.
Monitoring Types
Active monitoring – instrumentation in components or business code reports data to a central system (e.g., host status, module call monitoring, full‑link tracing).
Passive monitoring – no instrumentation; external probes simulate client requests and report results (e.g., return‑code monitoring, H5 speed tests).
Bypass monitoring – monitors without touching the service, such as public sentiment monitoring.
Data Sources
The system ingests data from ZhiYun Hubble multi‑dimensional analysis (ZhiYun Hubble), ZhiYun module call analysis (ZhiYun ModCall), and ZhiYun Business Life‑Dead Point monitoring (ZhiYun DLP).
Multi‑Dimensional Analysis
Client‑side logical instrumentation reports data to a unified interface, which is aggregated into multi‑dimensional statistics covering:
Basic dimensions such as province, ISP, network type, platform (Android/iOS), client version.
Business‑specific dimensions (e.g., live‑stream host vs. audience, command IDs).
Metric data such as return codes and latency.
Module Call Chain Analysis
The call chain monitors product, region, module, caller, callee, and interface, tracking success rate (based on return codes) and latency. Alerts are generated when a success‑rate threshold is breached for five consecutive minutes.
Business Life‑Dead Point Monitoring (DLP)
DLP monitors critical business indicators; alerts must be resolved immediately. A DLP metric can involve multiple module calls and other monitoring data. The system uses a 3‑sigma strategy (no fixed threshold) to flag anomalies and applies a half‑hour convergence window.
Technical Implementation
The anomaly detection module applies a 3‑sigma variant with a sliding‑window (5‑minute) average to smooth spikes. An alarm is raised after three consecutive abnormal points, with further convergence to reduce noise.
Root‑cause analysis uses decision‑tree models. After feature engineering (one‑hot encoding of categorical dimensions), the tree selects splits that maximize Gini impurity reduction, directly indicating the dimension where anomalies concentrate.
Feature engineering steps include one‑hot encoding of command IDs, expanding eight original dimensions to thousands of binary features. Positive and negative samples are balanced (1:1) to avoid bias.
Data sampling restores original counts (e.g., city‑level user counts) before feeding the model.
Root Cause Analysis
The decision tree outputs node statistics (negative/positive sample counts). Three metrics are derived:
Abnormal aggregation rate = negative / (positive + negative) for a node.
Abnormal detection rate = negative / total negative samples.
Weighted F1‑like score combining the two.
Nodes with highest scores are traced back to the root to produce a root‑cause path.
Association Analysis
Two aspects are covered:
Correlation between DLP alerts and other alarms.
Correlation between regular module‑call alerts.
Association rules are derived from historical alarm data. For module‑call alerts, a rule A→B is created when A and B co‑occur at least 30 times and the confidence exceeds 80 %. For DLP alerts, the support threshold is lower (≥3) due to their rarity, often yielding confidence near 100 %.
When a real‑time alarm matches a rule, an alarm chain is constructed, helping operators visualize the propagation path.
Summary
The AI‑driven system monitors KPI curves, detects a 0.24 % success‑rate drop, pinpoints the anomaly to disk‑related modules via decision‑tree root‑cause analysis, and correlates alarms to isolate the underlying issue (e.g., a cloud‑storage 2.0 fetch problem). This end‑to‑end pipeline shortens incident‑resolution time across client, access, and logic layers.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.