Operations 15 min read

How 360 Scaled AIOps: From Data to Self‑Healing Operations

At the 360 Internet Technology Training Camp, experts detailed how AI-driven AIOps can transform large‑scale operations, covering data collection, model‑based anomaly detection, alert correlation, self‑healing workflows, and visual dashboards, and presented a practical end‑to‑end framework that other companies can adopt quickly.

360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
How 360 Scaled AIOps: From Data to Self‑Healing Operations

With the explosive growth in scale and complexity of operations, traditional methods can no longer meet modern system management needs. AI technology has matured, giving rise to intelligent operations (AIOps), which brings significant changes and opportunities to the industry.

Introduction

On September 22, the 360 Internet Technology Training Camp (Session 18) hosted four talks on AIOps. The first, “AIOps in 360 – You Can Deploy Quickly,” described 360’s intelligent operations framework and component replacement suggestions, aiming to help small and medium companies start their AIOps journey. The second, “AI‑Driven Fault Self‑Healing Based on StackStorm,” demonstrated concrete scenarios where prediction, anomaly detection, and correlation analysis models were integrated to enable automated remediation.

Knowledge‑Graph‑Based CMDB

Speaker Xiao Yunpeng from Yixin presented a next‑generation intelligent CMDB built on knowledge graphs. While algorithms analyze massive data to generate operational rules, many fixed‑scenario rules can be directly derived from expert experience or existing command‑call relationships, reducing the need for large‑scale data mining.

Log‑Based Intelligent Operations

Du Weipu from LogEasy shared practices on intelligent operations and security using log big data. For third‑party SaaS providers, integrating agents can be challenging due to security concerns, so LogEasy focuses on large‑scale log ingestion, applying NLP keyword analysis and time‑series anomaly detection to derive insights.

360 AIOps Implementation

Since early 2018, 360 has built an end‑to‑end AIOps pipeline: operational big data → AI Center → Alert Self‑Healing → Operations Dashboard. Common high‑frequency scenarios such as resource reclamation, alarm noise reduction, and alarm correlation are addressed with AI models (classification, time‑series forecasting, anomaly detection, and root‑cause analysis).

The architecture mirrors a human body: monitoring data are the eyes, the AI Center is the brain, self‑healing actions are the hands/feet, and the dashboard is the face. This holistic view enables detection, analysis, and automated remediation.

Operations Dashboard

The dashboard visualizes resource reclamation costs, efficiency gains, backbone network metrics, and real‑time push notifications for large‑scale alarms. It supports interactive features such as zoom, gesture control, and usage analytics, encouraging creative extensions beyond traditional monitoring.

System Architecture

Data collection relies on custom agents that gather hardware, logs, processes, and external network quality metrics, storing them in appropriate back‑ends (Elasticsearch, MongoDB, InfluxDB). A lightweight gateway aggregates data, performs minimal processing, and forwards alerts, favoring a stateless web service design for easy scaling.

Core Models

Three primary models are deployed: time‑series forecasting for resource reclamation, time‑series anomaly detection for external network quality, and alarm correlation for I/O alerts. For example, anomaly detection replaces fixed thresholds with dynamic, per‑link thresholds generated by clustering and ensemble models (Isolation Forest, EWMA + 3σ, etc.).

Processing 100,000 data points per minute requires a pipeline that loads models in ~10 s, processes each point within 100 ms, and distributes load across a 10‑node high‑performance cluster using batch map‑reduce semantics.

Labeling Platform

Model outputs are continuously refined through a labeling platform where operators confirm or reject detected anomalies, feeding feedback back to offline training pipelines to improve model accuracy.

Self‑Healing Platform

Self‑healing actions target high‑frequency scenarios such as machine downtime, fault ticketing, and process restarts. Built on a customized StackStorm framework, atomic actions are composed into workflows, allowing users to assemble new remediation procedures via a graphical interface.

Summary

The complete pipeline—operational big data → AI Center → Labeling Platform → Alert Self‑Healing → Operations Dashboard—constitutes 360’s internal AIOps framework. The team aims to modularize each component for broader adoption across the organization and potentially as a service for external teams.

MonitoringBig Datamachine learningOperationsAIOpsSelf-healing
360 Zhihui Cloud Developer
Written by

360 Zhihui Cloud Developer

360 Zhihui Cloud is an enterprise open service platform that aims to "aggregate data value and empower an intelligent future," leveraging 360's extensive product and technology resources to deliver platform services to customers.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.