Operations 28 min read

Mastering SLOs: From Theory to Practical SRE Operations at Bilibili

This article outlines Bilibili's end‑to‑end SLO framework, covering metric selection, SLO definition, error‑budget calculation, alerting strategies, operational workflows, and lessons learned from real‑world deployments.

dbaplus Community

May 22, 2023

Mastering SLOs: From Theory to Practical SRE Operations at Bilibili

Background

Bilibili’s large‑scale microservice environment suffered from unclear availability definitions, noisy alerts, and fragmented operational processes. To obtain reliable reliability data and reduce alert fatigue, the SRE team built a full SLO lifecycle covering metric selection, measurement, alerting, error‑budget consumption, dashboards, and integration with CI/CD and multi‑active deployment platforms.

Availability Metric Challenges

Unclear target objects (service, interface, module, team).

Difficulty defining “unavailable” and standardising error codes.

Operational needs such as alerting, loss calculation, and reporting add complexity.

SLI / SLO Basics

SLI (Service Level Indicator) : quantitative metric (e.g., latency, error rate, request success rate).

SLO (Service Level Objective) : target value for an SLI.

SLA (Service Level Agreement) : contractual consequences, rarely used internally.

The methodology is to choose an SLI, define error conditions (HTTP 5xx, custom business codes), collect data from load balancers, Prometheus, or logs, and compute the metric.

Ideal SLO Implementation Process

Choose an SLI and compute it, e.g., success_rate = successful_requests / total_requests.

Define error conditions (HTTP 500/504, business‑level error codes).

Collect data via load balancer, Prometheus, or application logs.

Set SLO targets and calculate the error budget (e.g., 99.99 % yearly availability → 52 minutes downtime).

Build dashboards and reports for visibility.

First SLO Attempt (2019) and Lessons Learned

The initial system aimed for precise availability measurement but proved unsustainable due to metric noise, high operational overhead, and excessive alerting.

Revised SLO Lifecycle

Measure SLI and set SLO.

Introduce indicator governance to improve data accuracy.

Separate SLO‑based alerts from traditional symptom alerts.

Combine error‑budget consumption with alerting (e.g., a 1‑hour window consuming 2 % of the monthly budget triggers a critical alert).

Provide dashboards, reports, and ecosystem integration (CI/CD, chaos testing, multi‑active deployment platforms).

SLO Alerting Practices

Use SLO alerts as the primary symptom; other alerts become secondary noise.

Focus developer attention on application‑level alerts (SLO breaches, dependency errors, capacity issues).

Layer alerts (SLB HTTP 5xx, service‑level -5xx codes) to reduce noise.

Adopt a “1‑5‑10” strategy: thresholds for critical, moderate, and minor alerts based on error‑budget consumption.

Employ multi‑window and multi‑rate strategies to avoid flapping alerts.

Distribute alerts via an event‑center to precise on‑call groups, ensuring transparent collaboration.

Operational Workflow and Change Detection

SLO metrics are exposed to higher‑level platforms (CI/CD, multi‑active control) so that deployments can be gated by real‑time SLO health. When degradation is detected, automated pre‑checks or smart blocking prevent risky releases.

Key Technical Details

Metric aggregation (not averaging) to avoid distortion.

Traffic coloring to exclude test traffic from production SLO calculations.

Error‑budget calculation : e.g., yearly 99.99 % availability → 52 minutes allowed downtime; monthly 99.9 % → 43 hours budget.

Multi‑dimensional SLI : combine SLB HTTP 5xx, service‑level -5xx (business codes), and latency metrics.

Data sources : load‑balancer metrics (QPS, 5xx count, latency), service‑level Prometheus metrics, logs for business‑level error codes.

Alert suppression using error‑budget windows, consumption rates, and multi‑window checks to reduce flapping.

Root‑Cause Analysis

SLO dashboards are linked with observability data (logs, traces) to pinpoint failing components. An intelligent diagnosis platform enables developers to locate the fault within ~5 minutes.

Conclusion

The refined Bilibili SLO system demonstrates how a disciplined, data‑driven approach can turn reliability theory into actionable practice, reduce alert noise, improve developer‑SRE collaboration, and provide clear visibility into service health.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

SRE reliability engineering SLO Error Budget

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.