Operations 16 min read

How 360 Built an AI‑Powered Ops System to Cut Costs and Boost Efficiency

360’s AI‑ops team shares a year‑long journey of turning massive operational data into intelligent solutions—covering background, their AIOps philosophy, practical modules like capacity forecasting, host classification, resource reclamation, smart MySQL scheduling, anomaly detection, alarm reduction, and root‑cause analysis—to dramatically improve cost, efficiency, and reliability.

dbaplus Community
dbaplus Community
dbaplus Community
How 360 Built an AI‑Powered Ops System to Cut Costs and Boost Efficiency

Background

Rapid growth of internet services creates continuously evolving architectures, demanding 24/7 reliability that is infeasible without automation. 360 therefore built an AIOps “machine brain” to reduce manual operations, lower workload, and improve efficiency.

360’s View on AIOps

AIOps scenarios include anomaly detection, root‑cause analysis, self‑healing, and capacity forecasting. 360 classifies AIOps goals into three dimensions:

Cost reduction : AI‑driven resource saving and intelligent scheduling.

Efficiency improvement : AI‑based problem discovery, analysis, and automated resolution.

Stability enhancement : Proactive prevention of incidents.

Successful AIOps projects require three roles working together: operations engineers, operations developers, and machine‑learning engineers.

AIOps Foundations

Data Accumulation

Before launching AIOps, a large volume of heterogeneous data must be collected: machine‑level metrics, network flow, logs, and process information. A dedicated big‑data team spent over two years gathering and normalizing this data to support downstream analysis and model training.

Capacity Forecasting

Historical time‑series data enable prediction of key monitoring items. Series are categorized by volatility (stable vs. volatile) and periodicity (periodic vs. non‑periodic), requiring different forecasting models. Several approaches were evaluated, including a custom periodicity‑detection model and standard statistical or machine‑learning methods. The selected models will be open‑sourced.

Host Classification

Classification tasks such as identifying idle hosts or categorizing machines (CPU‑intensive, disk‑intensive, memory‑intensive) are solved with standard ML classifiers (e.g., SVM, decision trees) after feature engineering.

AIOps Projects

Resource Reclamation

The system predicts five key metrics (CPU usage, memory usage, network traffic, disk usage, connection count), extracts features, and classifies hosts as idle. Identified idle hosts are notified to owners for reclamation.

Negative‑sample scarcity was mitigated by combining manually labeled data with user‑generated labels.

MySQL Intelligent Scheduling

Instances are classified by a BP neural network (7 input features → 4 categories: low‑consumption, compute‑oriented, storage‑oriented, hybrid). Hosts are classified by a decision‑tree model using five host metrics. The scheduler respects constraints:

Minimize migration count.

Avoid primary‑node switches.

Limit the number of primary instances per host (≤5) and total instances.

Prevent same‑port instances on a single host.

Exclude black‑listed machines.

Testing on a data‑center reduced migrations to 45 and made 14 of 30 high‑load machines usable.

Anomaly Detection

For LVS traffic, multiple algorithms are combined: 3σ statistical rule, curve fitting, CNN/RNN, isolation forest, and a voting ensemble. An anomaly is flagged when a majority of models agree.

Online evaluation over six months achieved >95% detection accuracy.

Alarm Convergence

Historical alarm analysis using the Apriori algorithm mined frequent itemsets and generated rules of the form A→B. When an upstream alarm A fires, downstream alarm B can be suppressed, dramatically reducing alarm volume.

The rule base, combined with business‑level rating, cut alarm volume by 60‑80% over six months.

Root‑Cause Analysis of Alarm Events

Six major alarm categories (host alive, disk usage, disk read‑only, CPU idle, memory usage, disk I/O) were analyzed. For each alarm, the SIGKDD 2014 method was used to correlate events with time‑series metrics, selecting the top‑k features by information‑gain ratio, then classifying with XGBoost.

Example: the top‑5 metrics correlated with a host‑alive alarm are cpu.idle, net.if.total.bits.sum, mem.memused.percent, mem.swapused.percent, and ss.closed, narrowing the troubleshooting scope.

Experience and Summary

After nearly a year of effort, 360 achieved notable results in several single‑point applications. Future work includes:

Alarm‑level root‑cause localization.

Open‑sourcing core components (capacity forecasting, anomaly detection, alarm correlation).

Building an operations chatbot to close the loop from detection to resolution.

These initiatives aim to continuously reduce operational costs, improve efficiency, and enhance system stability.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

machine learninganomaly detectionresource schedulingaiopsCapacity Forecasting
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.