Operations 18 min read

How Tencent Revolutionized Monitoring: From IDC Crises to AI‑Driven AIOps

This talk by Tencent’s monitoring R&D lead outlines a decade of evolution in large‑scale monitoring, covering real‑world incident cases, the three drivers behind architectural upgrades, the implementation of a three‑dimensional monitoring framework, and the application of AI‑powered AIOps for precise, rapid anomaly detection.

Efficient Ops
Efficient Ops
Efficient Ops
How Tencent Revolutionized Monitoring: From IDC Crises to AI‑Driven AIOps

1. IDC Anomaly Case

During the weekend of February 10, 2018, a rack power outage occurred in a Shenzhen data center while most staff were asleep. The temperature anomaly was reported at 7:20 am, and the fault was traced to an air‑conditioning failure. Prompt alerts enabled the operations team to mobilize within ten minutes, assess impact across three regions, initiate business migration, and fully restore services by 7:40 am, demonstrating the critical role of timely alerts and accurate data.

2. Three Driving Forces

The first driver was the rapid growth of managed nodes: from 10,000 network elements to 60,000 servers, and eventually 200,000 nodes, which required a complete architectural overhaul.

The second driver was the shift from private IDC to hybrid and public cloud environments, where IP alone could no longer uniquely identify resources, prompting a redesign of the monitoring data model.

The third driver was the need for precise, centralized fault localization across millions of servers, leading to a micro‑service based monitoring platform that supports both passive and active data collection, multi‑dimensional metrics, and high‑throughput ingestion.

3. Three‑Dimensional Monitoring Solution

The solution combines traditional server/network monitoring, data‑layer performance metrics, and user‑side monitoring (including H5, HTTP response times, DNS latency, and sentiment analysis). Active probing (synthetic tests) is deployed in each IDC to measure CGI response times across regions.

Server‑side monitoring uses both passive collection and active probing via SNMP/IPMI. To minimize performance impact, a shared‑memory approach with atomic operations allows API reporting at up to 70 million calls per second, while multi‑dimensional aggregation reduces traffic by half, achieving 800 k events/second.

Data storage employs three models: a high‑performance TSDB for massive KPI metrics, a multi‑dimensional OLAP‑TSDB for complex queries, and a log storage engine. The TSDB provides millisecond‑level responses for millions of requests per second.

4. Intelligent Monitoring Scenarios (AIOps)

Smart monitoring emphasizes three principles: full coverage, precision (eliminating false alarms), and speed. Traditional threshold‑based alerts suffer from inaccuracy, maintenance difficulty, and alarm fatigue. Unsupervised algorithms are introduced to detect anomalies with temporal correlation and root‑cause analysis.

When micro‑service call graphs are unavailable, periodic packet captures combined with domain knowledge and AI techniques reconstruct service dependencies, enabling rapid identification of the underlying cause of alerts.

Case studies show that even a 2‑percentage‑point drop in success rate can be detected and traced to specific dimensions (e.g., mobile QQ), with automated analysis guiding developers to relevant logs for swift resolution.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

monitoringBig Datacloud computingOperationsaiops
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.