Operations 38 min read

How Ant Group Leverages SLO and AIOps for Fine‑Grained Operations

This article details Ant Group's practical implementation of Service Level Objectives (SLO) and AIOps to achieve fine‑grained operations, covering SLO fundamentals, health‑score architecture, GitOps‑based data pipelines, error‑budget alerting, AI‑driven anomaly detection, fault localization techniques, and real‑world case studies on dashboards, Kubernetes SLOs, and emergency response workflows.

dbaplus Community

Feb 4, 2024

How Ant Group Leverages SLO and AIOps for Fine‑Grained Operations

SLO Fundamentals

A Service Level Objective (SLO) defines a quantitative target for a Service Level Indicator (SLI), which is an observable metric of a specific service dimension. An SLA is a contract that enforces the SLO and specifies penalties for breach. Example: a monthly availability SLO of 99.9% permits at most 43 minutes of downtime.

Key Extensions

SLF (Service Level Factor) : additional dimensions injected into SLO data to enable fine‑grained drilling during alerts.

SLD (Service Level Dependency) : a graph of dependency relationships between SLOs, used for fault‑propagation analysis.

Error Budget : the allowable amount of error (e.g., 0.1 % of requests) before the SLO is considered breached. The consumption rate is the burn rate.

Error‑Budget Alerting : combines fixed‑threshold alerts with error‑budget consumption to produce multi‑severity warnings.

SLO Health‑Score Architecture

The system is organized into four layers, all built on GitOps and Prometheus:

Target System Layer – the actual services and applications.

Data Layer – collection, processing, storage, and modeling of SLI/SLO data.

Scenario‑Analysis Layer – anomaly detection, fault discovery, root‑cause analysis, and remediation recommendation.

Application Layer – dashboards, emergency pipelines, cost allocation, and downstream integrations.

Implementation Details

SLO definitions are stored as YAML files in a Git repository and deployed to Prometheus via ArgoCD.

Prometheus recording_rules compute SLO metrics from raw time‑series data.

Grafana visualizes health dashboards; alerts are routed through phone, email, and DingTalk.

Data Model

Prometheus stores single‑value time‑series (timestamp, value, labels). SLO aggregates collapse label dimensions to produce macro‑level health scores.

AIOps Integration

Machine‑learning techniques are applied to SLO time‑series to improve detection, prediction, and remediation.

Anomaly Detection

Statistical rules (3‑sigma, box‑plot, Tukey, Grubbs, Dixon, t‑test) and ML models flag outliers in SLO streams.

Time‑Series Prediction

Models such as ARIMA, EMA, Holt‑Winters, STL, RNN, and LSTM forecast future SLO trends to anticipate violations.

Fault Localization Techniques

Temporal correlation : Pearson or cosine similarity on aligned time‑series.

Dependency correlation : Traverses the SLD graph to trace fault propagation.

Spatial correlation : Uses CMDB or topology data stored in a graph database.

Historical correlation : Matches current anomalies against a knowledge base of past incidents via Bayesian inference.

Intelligent Decision Making

Localization results are combined with expert knowledge and trained models to automatically select remediation playbooks. Low‑impact actions are executed automatically; high‑impact actions require manual confirmation.

Building SLO Health from Scratch

Steps to construct the system:

Define SLOs in YAML (name, template, SLI, window, target, error‑budget parameters) and commit to a Git repo.

Configure Prometheus recording_rules that calculate the SLO metric using PromQL. Example:

record: slo:availability:ratio
expr: sum(rate(http_requests_total{status!~"5.."}[5m]))
  / sum(rate(http_requests_total[5m]))
labels:
  slo: "availability"
  window: "1h"

Deploy the rules with ArgoCD; Prometheus reloads them automatically.

Expose the SLO series to Grafana for dashboards and to Alertmanager for alert routing.

Implement SLF and SLD extensions by adding extra label dimensions and a dependencies field in the YAML, e.g.:

dependencies:
  - service: scheduler
    metric: scheduler_latency_seconds
  - service: etcd
    metric: etcd_db_size_bytes

Integrate AIOps modules: statistical anomaly detectors, ML‑based predictors, and fault‑localization engines that consume the SLO series.

Configure multi‑window error‑budget alerts (short‑window error‑rate + burn‑rate thresholds) and map severity to notification channels.

Case Studies

SLO Quality Dashboard

Aggregates compliance across product groups with monthly, weekly, and daily views. Red cells indicate violations; clicking a product expands detailed SLO metrics and links to underlying resources.

Kubernetes SLO System

Defines SLOs for API‑server latency, pod‑creation success rate, controller/scheduler availability, node health, and downstream services (DNS, network, storage). Both PaaS‑level and pod‑level metrics are tracked.

SLO‑Driven Emergency Workflow

Alert delivery with priority‑based routing (DingTalk group, phone call).

Automated fault localization using temporal, dependency, spatial, and historical methods.

Generation of a fault report containing affected SLFs/SLDs and root cause.

Recommendation of remediation playbooks; low‑impact actions are auto‑executed, high‑impact actions require manual confirmation via DingTalk links.

Post‑remediation verification through SLO recovery and anomaly‑detection checks.

Conclusion

SLOs provide a transparent, quantitative framework for service‑quality management. Publishing SLAs aligns expectations across teams, while error‑budget‑driven alerts shift operations from fault‑driven to SLO‑driven. Enriching SLOs with AIOps creates a unified emergency pipeline that improves reliability, operational efficiency, and cost control. The described architecture has been productized and is being extended as SaaS and open‑source offerings.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Fault Localization Operations kubernetes AIOps SLO Error Budget

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.