Artificial Intelligence 18 min read

How AIOps Transforms DevOps: Real-World Cases from Tencent

This article explores the emerging field of AIOps, comparing rule‑based operations with AI‑driven approaches, outlining a five‑level AIOps maturity model, and presenting several Tencent case studies that demonstrate cost reduction, quality improvement, root‑cause analysis, and automated scaling.

Efficient Ops

Aug 19, 2018

How AIOps Transforms DevOps: Real-World Cases from Tencent

1. From an NLP story

In the 1930s–1940s researchers tried rule‑based parsing of natural language, later shifting to statistical methods. By the 1970s rule‑based parsing faded, and statistical approaches like IBM's speech recognition and Google's statistical translation surpassed rule‑based systems.

The speaker draws a parallel between the evolution of NLP and operations: as services grow, logs and metrics explode, making rule‑based operations increasingly inadequate.

Rules are easy to understand but cannot keep up with massive, complex data. AIOps augments DevOps by applying AI to the rule‑based part, turning static if‑else logic into learnable models.

2. From API to "learning component"

"Learning component" (学件) – a concept from Prof. Zhou Zhihua – refers to a model that continuously learns from data while keeping the algorithm open and data private.

These components can be trained on proprietary data, then shared without exposing raw data, enabling safe, evolvable AI solutions.

3. Practice case studies

3.1 Cost – Intelligent memory cooling

Memory‑heavy KV stores generate high cost as data grows. By extracting dozens of features per data type and training logistic regression and random‑forest classifiers, Tencent identified ~90% of data suitable for cold‑storage, moving it to disk without affecting latency or success rate, achieving an 8‑10× efficiency gain.

3.2 Quality – Unified monitoring without thresholds

Traditional threshold‑based alerts fail for dynamic metrics. Tencent applied a three‑step pipeline: (1) statistical 3‑sigma detection, (2) unsupervised isolation‑forest anomaly scoring, and (3) supervised labeling to refine alerts. This approach covered >120 000 monitoring views across >100 000 devices, achieving full‑coverage anomaly detection.

3.3 Quality – ROOT intelligent root‑cause analysis

By building a service‑call graph and clustering tightly coupled modules (using DBSCAN), then applying frequent‑itemset mining and correlation analysis (and experimenting with Bayesian ranking), the system isolates probable root causes of incidents.

3.4 Efficiency – Automatic scaling (Weaving Cloud)

Weaving Cloud automates a 20‑step resource provisioning workflow. When CPU usage exceeds 75 %, the system triggers auto‑scaling, adds new instances, and balances load using a learned loss function optimized by gradient descent, improving capacity utilization by ~22 %.

4. Reflection and outlook

AIOps is still nascent; the goal is to package learnable components as reusable modules, standardize data formats, and foster community‑wide sharing. The speaker envisions public AIOps components, shared standards, and a future where AIOps becomes a trusted, intelligent assistant for IT operations.

Note: This article is compiled from Zhao Jianchun’s presentation at DOIS 2018 Beijing.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

aiops Operations Automation

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.