SLO Implementation and Alerting Strategies – Bilibili SRE Practices
The article outlines Bilibili’s refined SLO framework—categorizing services into four business tiers, selecting availability, latency, and freshness SLIs, setting concrete SLO targets, and employing multi‑window error‑budget and consumption‑rate alerting strategies to improve stability and provide comprehensive quality dashboards.
Author : Wu Anchu, Head of Bilibili Business SRE. Joined Bilibili in 2016, participated in micro‑service decomposition, cloud‑native transformation, high‑availability construction, SRE transition and stability system implementation.
01 Background
The previous article introduced the initial SLO practice at Bilibili and the problems encountered. After reflecting on the value of SLO, this article presents a revised SLO construction approach.
02 Business Tier Optimization
Business levels are divided into four tiers (L0‑L3):
L0 – Company‑level core services (e.g., recommendation, video playback, payment) with requirements such as DAU > xx W, daily revenue > xx W, or strategic importance.
L1 – Department‑level core services that depend on L0 (e.g., one‑click‑like on video page).
L2 – User‑facing services (e.g., playlists, sharing, columns).
L3 – Backend or non‑user‑impacting services.
The model is simplified: when creating an application, the user tags its importance (core, important, normal) once, eliminating separate application and API grading.
03 SLI Selection and Calculation
Typical micro‑service call chain metrics include availability and latency, measured either by SLB (load‑balancer) metrics or by HTTP/gRPC server metrics. Additional freshness metrics are collected from middleware (MQ, DB replication, Canal, DTS) to capture data‑staleness.
Application‑level SLI categories:
Availability – error count (HTTP 5xx) and success rate from SLB and server metrics.
Latency – percentile latency from SLB and server metrics.
Freshness – MQ write/consume delay, DB master‑slave sync delay, Canal sync delay, DTS sync delay.
Business‑core‑function SLI focuses on the same availability, latency, and throughput metrics for specific APIs.
04 SLO and Alerting
Application SLO examples:
Availability ≥ 99.99 %
99th‑percentile latency ≤ 1 s
Data‑update freshness ≤ 5 min
Business‑core‑function SLO mirrors the availability and latency thresholds; throughput is monitored via dashboards rather than SLO.
Alerting strategies discussed:
Target error‑rate ≥ SLO threshold within a short window (e.g., 10 min). High recall, low precision.
Increase alert window (e.g., 30 days × 5 % budget) – better precision but slow recovery.
Require sustained breach (e.g., 10 min or 1 h) before alert – higher precision, lower recall.
Consumption‑rate alerts based on error‑budget burn rate; multiple thresholds (1 h, 6 h, 3 d) to balance precision and recall.
Combine short and long windows to reduce false‑positives while catching low‑rate budget consumption.
Images illustrate the alerting windows, consumption‑rate calculations, and Bilibili’s current SLO alert configuration.
05 Quality Operations
With SLI and SLO alerts in place, Bilibili builds comprehensive quality dashboards covering business SLI, core‑function health, application health, and cross‑layer health matrices.
06 Conclusion
SLOs are essential for SRE: they define service capabilities, trigger timely alerts, drive stability improvements, and enable collaborative quality enhancement between SRE and product teams.
References: Bilibili SRE articles, Google SRE Workbook (implementing SLOs, alerting on SLOs).
Bilibili Tech
Provides introductions and tutorials on Bilibili-related technologies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.