AIOps Practices at 360: Cost Reduction, Efficiency Gains, and Intelligent Operations
This article presents 360's AIOps project, detailing how AI-driven capacity forecasting, host classification, resource recycling, intelligent MySQL scheduling, anomaly detection, alarm convergence, and root‑cause analysis have saved millions, improved efficiency, and paved the way for a fully automated operations workflow.
At the beginning of the year, the 360 Operations Development team launched an AIOps project that has already saved the company 35 million yuan by reducing operational costs and improving efficiency.
The presentation, delivered by machine‑learning engineer Ji Xinpǔ in the 168th dbaplus online session, covers background, 360's thinking on AIOps, practical solutions, and lessons learned.
Background
With the explosive growth of internet hardware and software, operations staff are expected to maintain 24/7 reliability, which is practically impossible without automation. The team explores whether a "machine brain" can replace human operators by applying AI algorithms to historical data.
360's Thoughts on AIOps
AIOps scenarios include anomaly detection, root‑cause analysis, self‑healing, and capacity prediction. 360 categorizes AIOps into three focus areas: cost, efficiency, and stability, requiring collaboration among operations engineers, ops developers, and machine‑learning engineers.
AIOps Practice Solutions
1. Foundations
Data Accumulation – Over two years, big‑data engineers collected machine‑level, network, log, and process data to build a solid foundation for analysis and model training.
2. Capacity Forecasting
Historical data is used to predict key metrics (CPU, memory, network, disk, connection count). Different time‑series patterns require different forecasting models, including a custom periodicity detection model.
3. Host Classification
Machine‑learning classifiers (SVM, decision trees, etc.) are used to identify idle hosts and categorize machines (CPU‑intensive, storage‑intensive, etc.) based on monitoring features.
4. Projects
Resource Recycling – Predicts five key indicators, classifies idle machines, and notifies owners to reclaim resources, achieving significant utilization improvements.
MySQL Intelligent Scheduling – Uses a BP neural network to classify instances (low‑cost, compute‑type, storage‑type, hybrid) and a decision‑tree model to classify hosts, then schedules instances while respecting constraints such as migration count, master‑instance limits, and blacklist avoidance.
5. Anomaly Detection
Multiple algorithms (3σ rule, curve fitting, CNN/RNN, isolation forest, statistical methods) are combined; an alarm is raised only if a majority of models flag an anomaly. The solution achieved >95% accuracy on LVS traffic data.
6. Alarm Convergence
Apriori algorithm extracts frequent itemsets from historical alarms to create A→B rules, reducing duplicate alerts by 60‑80% when combined with expert knowledge.
7. Root‑Cause Analysis
For each alarm event, the team applies the method from the 2014 SIGKDD paper "Correlating Events with Time Series for Incident Diagnosis" to filter relevant metrics, selects top‑k features by information‑gain ratio, and classifies using XGBoost, achieving high accuracy.
Experience and Summary
After nearly a year of effort, 360 has realized tangible benefits in single‑point applications and outlines future work: alarm‑level root‑cause localization, open‑source components for capacity forecasting, anomaly detection, and alarm correlation, and an operations chatbot to close the detection‑analysis‑resolution loop.
Live replay link: https://m.qlchat.com/topic/details?topicId=2000002350036659&tracePage=liveCenter
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
360 Tech Engineering
Official tech channel of 360, building the most professional technology aggregation platform for the brand.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
