How 360 Leverages AIOps to Transform Operations and Cut Costs
The article describes 360's evolution of its operations platform, the integration of AIOps for efficiency, cost reduction, and stability, and details practical implementations such as resource recycling, intelligent MySQL scheduling, anomaly detection, and lessons learned for sustainable AI‑driven operations.
Background
360's operations system consists of a foundational platform that includes resource management and workflow tickets, built on the private cloud HULK which provides virtualization, databases, middleware, and SaaS services. Above this, the cloud platform supports two main business lines: the security line and the revenue‑generating line.
360 Operations System Changes
Standardization began in 2012; without good standardization, AIOps data reduction is impossible. From 2012‑2016 the focus was on fine‑grained and platform‑based data collection on the private cloud. In 2016‑2017 visualization work started, and a big‑data operations platform was created by aggregating monitoring and business‑level metrics. By 2018, single‑point applications were linked, achieving an intelligent closed‑loop that greatly improves developer and operations productivity through AI‑assisted decision making.
Overview of 360 Operations
360 manages hundreds of IDC sites, over 100,000 servers, and EB‑level security data with billions of requests. The architecture is decoupled and layered, offering many PaaS and SaaS services and supporting diverse business systems such as search, video, information flow, e‑commerce, and finance.
360's Thinking on AIOps
The three focus areas are efficiency, cost, and stability. For efficiency, the team pursues intelligent operations, baseline anomaly detection, alarm convergence, disk‑failure prediction, auto‑healing, and automated restarts. For cost, they target budget control, server acceleration, resource reclamation, and intelligent MySQL port scheduling.
The AIOps team is organized like a medical team: diagnosticians who understand business and data, developers with big‑data and programming skills, and algorithm experts with engineering experience to build platforms that fuse business, data, and AI.
Intelligent Operations Practice
Practice Background
Many internal teams have unique data and want to explore intelligent operations but lack visibility into others' data. 360 provides a shared data platform and algorithm support to accelerate AIOps development across the company.
Resource Recycling System
The process includes six steps: data governance of monitoring items, capacity forecasting using intelligent methods, machine profiling with manual verification, notification to business lines, and feedback loops for model improvement. Capacity forecasts combine machine‑level monitoring predictions with quantitative historical analysis. Various algorithms are compared for different metrics.
Machine classification uses models such as BPANN, SVM, and decision trees, with the decision tree achieving 99% accuracy. Classified hosts are stored in a database for reclamation recommendations, and a labeling system allows operations engineers to refine training data.
To improve email response rates, the team personalizes reclamation notifications with gender‑specific sender names and tone, making messages more engaging for business owners.
MySQL Intelligent Scheduling System
Many internet companies waste MySQL resources. 360 monitors database metrics to profile ports and classifies them into four categories: low I/O, compute‑intensive, disk‑intensive, and mixed. A training set of over 900 instances was manually labeled, and a neural network with 7 input features, 14 hidden units, and 4 output classes achieved 95% test accuracy.
The scheduling aims to minimize migration count, reduce impact on services, keep primary databases stable, and respect black‑list constraints. Early results show that 14 out of 30 high‑spec database machines can be reclaimed.
Anomaly Detection and Alarm Convergence
Traditional detection uses static thresholds; 360 adds multi‑trigger conditions, alarm convergence, and severity weighting based on business importance. Using Apriori‑style association rules, they reduced correlated alarms by 60‑80%.
Root‑cause analysis combines multiple monitoring items within a time window and applies information‑gain metrics to identify key factors, achieving around 80% accuracy in online validation.
Experience and Summary
Key lessons include selecting appropriate time‑series databases (InfluxDB for metrics, MongoDB for logs, MySQL for relational data, Redis for counters, Elasticsearch for search). Project management follows the PDCA cycle: plan, do, check, act, with continuous iteration to avoid high failure rates.
The operations big‑data platform is layered: raw layer (raw metrics), detail layer (standardized, business‑tagged data), aggregation layer (domain‑specific AIOps data), and application layer (use‑case implementations). This architecture stems from 360's big‑data platform department.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
