How AIOps Can Transform IT Operations: A Practical Roadmap from Tsinghua’s Pei Dan
This article outlines the challenges of modern IT operations, explains why traditional methods fall short, and presents a detailed AIOps roadmap—including anomaly detection, localization, root‑cause analysis, and prediction—drawn from Professor Pei Dan’s research and real‑world examples.
Overview
Machine‑learning‑driven intelligent operations (AIOps) have become a major trend in the operations field. On January 11, Tsinghua University’s Professor Pei Dan visited 360 to share practical experiences on AIOps implementation.
1. Goals and Significance of Operations
Operations will play an increasingly critical role as the number of devices worldwide is projected to reach 50‑100 billion by 2020, supporting services across the Internet, finance, IoT, smart manufacturing, telecom, power grids, and government. Operations must ensure reliable, high‑speed, efficient, and secure business continuity, directly affecting revenue and cost.
2. Pain Points
Current operational challenges include detecting, containing, repairing, and preventing unexpected failures, which cause significant pain for operators.
3. Sources of Pain
Complex, rapidly evolving hardware and software ecosystems—such as large‑scale network topologies, cloud‑center updates, micro‑service call graphs, and continuous integration/DevOps practices—create a flood of incidents that human‑only analysis cannot keep up with.
4. Solution – AIOps
By leveraging massive monitoring data and machine‑learning techniques, AIOps aims to automate decision‑making for incident detection, mitigation, root‑cause analysis, and prediction. The envisioned workflow includes data collection, an AIOps engine that suggests actions, expert validation, and automated script execution for mitigation, repair, or capacity scaling.
The engine’s modules—anomaly detection, anomaly localization, root‑cause analysis, and anomaly prediction—work together to reduce human workload and improve response speed.
5. AIOps Deployment Status
Many remain cautious because applying off‑the‑shelf machine‑learning models as black boxes often fails. Challenges include lack of labeled data, algorithm selection for millions of KPIs, and model drift after software changes.
Professor Zhang Bo of Tsinghua emphasizes that AI excels when data is abundant, the problem is well‑defined, and the domain is narrow. Therefore, AIOps must decompose complex operational problems into AI‑friendly sub‑tasks.
6. Methodology – “Butcher’s Knife” Approach
Complex problems are broken down into well‑defined pieces that AI can solve. For anomaly detection, a multi‑stage pipeline is proposed: start with unsupervised detection, then a lightweight UI for operators to label cases, followed by supervised refinement, transfer‑learning‑based adaptation for pattern shifts, and finally KPI clustering to assign appropriate algorithms at scale.
7. Fault Localization, Repair, and Prevention
After detection, the system localizes the fault to a granularity sufficient for predefined remediation scripts (e.g., rollback, scaling, traffic shifting). Root‑cause analysis builds a fault tree using call‑graph and configuration data, then prunes it with AI‑driven correlation. Prediction modules forecast performance bottlenecks, capacity limits, and potential failures, enabling proactive actions.
8. Summary
The presented AIOps roadmap, built on years of research and collaboration between academia and industry, calls for a community‑driven challenge platform to gather industrial data and algorithmic expertise, accelerating the practical adoption of AIOps worldwide.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
360 Zhihui Cloud Developer
360 Zhihui Cloud is an enterprise open service platform that aims to "aggregate data value and empower an intelligent future," leveraging 360's extensive product and technology resources to deliver platform services to customers.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
