Operations 16 min read

How Tencent Scaled AIOps: Building a Unified Operations PaaS Platform

Tencent’s integrated AIOps platform, ZhiYun, demonstrates how a unified operations PaaS—leveraging CMDB-driven object modeling, layered monitoring, and AI-powered data banking—can transform massive-scale service management from manual, fragmented processes into automated, business‑value‑focused operations across hundreds of thousands of devices.

Efficient Ops
Efficient Ops
Efficient Ops
How Tencent Scaled AIOps: Building a Unified Operations PaaS Platform

Preface

When mentioning Tencent’s operations team, the term “massive” immediately comes to mind. Since 2004, former CTO Zhang Zhidong proposed the “Massive Operations Methodology”, guiding the team to build systematic operational capabilities that support a wide range of products.

The integrated AIOps platform “ZhiYun” has evolved from managing a few thousand devices to over 200,000, illustrating the transition from traditional operations to intelligent operations.

Note: Tencent ZhiYun Community Edition (Lite) will be released soon.

Building a General Operations PaaS Platform

Before AI could be combined with operations, many routine problems had to be solved. Tencent adopts a business‑value‑oriented operations philosophy, classifying operation objects into layers such as network, device, system, component, business, and user.

Each operation object contains configuration, business, monitoring, and tool‑association attributes that evolve throughout its lifecycle.

Standardizing these objects enables a CMDB to record, consume, and update them, forming an online experience repository that guides both human operators and tools.

Ensuring CMDB data consistency with operational tools and monitoring systems is critical. Tencent solves this by managing the full lifecycle of operation objects with standardized tools and processes.

Controlling CMDB read/write scenarios guarantees data consistency, traceability, and auditability, linking operational changes to monitoring alerts.

All control operations are abstracted as “resource‑transfer‑execution”, forming the basis for tool platform design.

The platform tool‑chains enable serial execution of multiple tools, supporting complex operational scenarios.

By modeling operations around CMDB data and tool‑chains, repetitive tasks become standardized, reducing reliance on individual expertise.

Constructing a Three‑Dimensional Monitoring System

Operations at Tencent are treated as “technical operations”, encompassing data collection, analysis, and alerting. Monitoring is divided into three dimensions:

Monitoring – coverage, status feedback, metric measurement.

Alerting – timeliness, accuracy, correlation, reach.

Operations – RCA, incident management, reporting, assessment.

Monitoring data is categorized into low‑level (infrastructure) and high‑level (business) indicators. High‑level metrics directly reflect service availability, while low‑level metrics often generate noise.

By consolidating low‑level metrics into high‑level ones, the system improves signal‑to‑noise ratio and aligns alerts with business impact.

The integrated AIOps platform unifies monitoring and alerting, leveraging CMDB relationships to provide a holistic view.

Exploring AI‑Driven Operations Scenarios

After establishing quality and efficiency foundations, Tencent applied AI to operations. AIOps requires massive labeled data for supervised or unsupervised learning, yet data silos across heterogeneous monitoring systems hindered this.

The “Data Bank” centralizes and preprocesses diverse monitoring data, offering generic pipelines such as regex parsing, translation, statistics, and numerical computation, as well as plugin extensibility.

It also provides analytical capabilities like OLAP drill‑down, Gaussian analysis, and clustering for time‑series data.

Using this foundation, the “monitor” system processes 3 million time‑series points per second, combining statistical (3‑Sigma) and unsupervised (Isolation Forest) algorithms to flag anomalies, then refines models with human‑labeled samples.

This approach enables threshold‑free, second‑level monitoring of massive metrics.

The detection framework consists of offline training (statistical + unsupervised, human review, supervised training) and online prediction (model loading, real‑time inference, continuous human correction), with an A/B testing module for model rollout.

Offline: statistical and unsupervised algorithms generate candidate anomalies; humans label them; features are extracted; supervised models are trained.

Online: deployed models evaluate live data; false positives are fed back for retraining; A/B testing selects the best model for global deployment.

Beyond anomaly detection, AI techniques such as decision trees, Apriori/FP‑Growth, NLP, and reinforcement learning are explored for root‑cause analysis, alert convergence, complaint detection, and performance tuning.

The integrated AIOps platform, with its CMDB, automation, monitoring, and data‑bank capabilities, abstracts massive operational scenarios into reusable tools and data models, making Tencent’s standardized operations applicable to traditional enterprises.

Conclusion

The convergence of AI and operations offers new solutions to long‑standing operational challenges, accelerating the adoption of intelligent operations across enterprises. Tencent’s ZhiYun platform aims to share its accumulated methodologies, technologies, and data models with cloud customers, inviting them to join the business‑value‑driven operations movement.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AIPaaSaiopsCMDB
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.