How Ant Group Leverages SLO and AIOps for Fine‑Grained Operations
This article details Ant Group's practical implementation of Service Level Objectives (SLO) and AIOps to achieve fine‑grained operations, covering SLO fundamentals, health‑score architecture, GitOps‑based data pipelines, error‑budget alerting, AI‑driven anomaly detection, fault localization techniques, and real‑world case studies on dashboards, Kubernetes SLOs, and emergency response workflows.
SLO Fundamentals
A Service Level Objective (SLO) defines a quantitative target for a Service Level Indicator (SLI), which is an observable metric of a specific service dimension. An SLA is a contract that enforces the SLO and specifies penalties for breach. Example: a monthly availability SLO of 99.9% permits at most 43 minutes of downtime.
Key Extensions
SLF (Service Level Factor) : additional dimensions injected into SLO data to enable fine‑grained drilling during alerts.
SLD (Service Level Dependency) : a graph of dependency relationships between SLOs, used for fault‑propagation analysis.
Error Budget : the allowable amount of error (e.g., 0.1 % of requests) before the SLO is considered breached. The consumption rate is the burn rate.
Error‑Budget Alerting : combines fixed‑threshold alerts with error‑budget consumption to produce multi‑severity warnings.
SLO Health‑Score Architecture
The system is organized into four layers, all built on GitOps and Prometheus:
Target System Layer – the actual services and applications.
Data Layer – collection, processing, storage, and modeling of SLI/SLO data.
Scenario‑Analysis Layer – anomaly detection, fault discovery, root‑cause analysis, and remediation recommendation.
Application Layer – dashboards, emergency pipelines, cost allocation, and downstream integrations.
Implementation Details
SLO definitions are stored as YAML files in a Git repository and deployed to Prometheus via ArgoCD.
Prometheus recording_rules compute SLO metrics from raw time‑series data.
Grafana visualizes health dashboards; alerts are routed through phone, email, and DingTalk.
Data Model
Prometheus stores single‑value time‑series (timestamp, value, labels). SLO aggregates collapse label dimensions to produce macro‑level health scores.
AIOps Integration
Machine‑learning techniques are applied to SLO time‑series to improve detection, prediction, and remediation.
Anomaly Detection
Statistical rules (3‑sigma, box‑plot, Tukey, Grubbs, Dixon, t‑test) and ML models flag outliers in SLO streams.
Time‑Series Prediction
Models such as ARIMA, EMA, Holt‑Winters, STL, RNN, and LSTM forecast future SLO trends to anticipate violations.
Fault Localization Techniques
Temporal correlation : Pearson or cosine similarity on aligned time‑series.
Dependency correlation : Traverses the SLD graph to trace fault propagation.
Spatial correlation : Uses CMDB or topology data stored in a graph database.
Historical correlation : Matches current anomalies against a knowledge base of past incidents via Bayesian inference.
Intelligent Decision Making
Localization results are combined with expert knowledge and trained models to automatically select remediation playbooks. Low‑impact actions are executed automatically; high‑impact actions require manual confirmation.
Building SLO Health from Scratch
Steps to construct the system:
Define SLOs in YAML (name, template, SLI, window, target, error‑budget parameters) and commit to a Git repo.
Configure Prometheus recording_rules that calculate the SLO metric using PromQL. Example:
record: slo:availability:ratio
expr: sum(rate(http_requests_total{status!~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))
labels:
slo: "availability"
window: "1h"Deploy the rules with ArgoCD; Prometheus reloads them automatically.
Expose the SLO series to Grafana for dashboards and to Alertmanager for alert routing.
Implement SLF and SLD extensions by adding extra label dimensions and a dependencies field in the YAML, e.g.:
dependencies:
- service: scheduler
metric: scheduler_latency_seconds
- service: etcd
metric: etcd_db_size_bytesIntegrate AIOps modules: statistical anomaly detectors, ML‑based predictors, and fault‑localization engines that consume the SLO series.
Configure multi‑window error‑budget alerts (short‑window error‑rate + burn‑rate thresholds) and map severity to notification channels.
Case Studies
SLO Quality Dashboard
Aggregates compliance across product groups with monthly, weekly, and daily views. Red cells indicate violations; clicking a product expands detailed SLO metrics and links to underlying resources.
Kubernetes SLO System
Defines SLOs for API‑server latency, pod‑creation success rate, controller/scheduler availability, node health, and downstream services (DNS, network, storage). Both PaaS‑level and pod‑level metrics are tracked.
SLO‑Driven Emergency Workflow
Alert delivery with priority‑based routing (DingTalk group, phone call).
Automated fault localization using temporal, dependency, spatial, and historical methods.
Generation of a fault report containing affected SLFs/SLDs and root cause.
Recommendation of remediation playbooks; low‑impact actions are auto‑executed, high‑impact actions require manual confirmation via DingTalk links.
Post‑remediation verification through SLO recovery and anomaly‑detection checks.
Conclusion
SLOs provide a transparent, quantitative framework for service‑quality management. Publishing SLAs aligns expectations across teams, while error‑budget‑driven alerts shift operations from fault‑driven to SLO‑driven. Enriching SLOs with AIOps creates a unified emergency pipeline that improves reliability, operational efficiency, and cost control. The described architecture has been productized and is being extended as SaaS and open‑source offerings.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
