Operations 18 min read

SLO Implementation and Alerting Strategies – Bilibili SRE Practices

The article outlines Bilibili’s refined SLO framework—categorizing services into four business tiers, selecting availability, latency, and freshness SLIs, setting concrete SLO targets, and employing multi‑window error‑budget and consumption‑rate alerting strategies to improve stability and provide comprehensive quality dashboards.

Bilibili Tech

Aug 12, 2022

SLO Implementation and Alerting Strategies – Bilibili SRE Practices

Author : Wu Anchu, Head of Bilibili Business SRE. Joined Bilibili in 2016, participated in micro‑service decomposition, cloud‑native transformation, high‑availability construction, SRE transition and stability system implementation.

01 Background

The previous article introduced the initial SLO practice at Bilibili and the problems encountered. After reflecting on the value of SLO, this article presents a revised SLO construction approach.

02 Business Tier Optimization

Business levels are divided into four tiers (L0‑L3):

L0 – Company‑level core services (e.g., recommendation, video playback, payment) with requirements such as DAU > xx W, daily revenue > xx W, or strategic importance.

L1 – Department‑level core services that depend on L0 (e.g., one‑click‑like on video page).

L2 – User‑facing services (e.g., playlists, sharing, columns).

L3 – Backend or non‑user‑impacting services.

The model is simplified: when creating an application, the user tags its importance (core, important, normal) once, eliminating separate application and API grading.

03 SLI Selection and Calculation

Typical micro‑service call chain metrics include availability and latency, measured either by SLB (load‑balancer) metrics or by HTTP/gRPC server metrics. Additional freshness metrics are collected from middleware (MQ, DB replication, Canal, DTS) to capture data‑staleness.

Application‑level SLI categories:

Availability – error count (HTTP 5xx) and success rate from SLB and server metrics.

Latency – percentile latency from SLB and server metrics.

Freshness – MQ write/consume delay, DB master‑slave sync delay, Canal sync delay, DTS sync delay.

Business‑core‑function SLI focuses on the same availability, latency, and throughput metrics for specific APIs.

04 SLO and Alerting

Application SLO examples:

Availability ≥ 99.99 %

99th‑percentile latency ≤ 1 s

Data‑update freshness ≤ 5 min

Business‑core‑function SLO mirrors the availability and latency thresholds; throughput is monitored via dashboards rather than SLO.

Alerting strategies discussed:

Target error‑rate ≥ SLO threshold within a short window (e.g., 10 min). High recall, low precision.

Increase alert window (e.g., 30 days × 5 % budget) – better precision but slow recovery.

Require sustained breach (e.g., 10 min or 1 h) before alert – higher precision, lower recall.

Consumption‑rate alerts based on error‑budget burn rate; multiple thresholds (1 h, 6 h, 3 d) to balance precision and recall.

Combine short and long windows to reduce false‑positives while catching low‑rate budget consumption.

Images illustrate the alerting windows, consumption‑rate calculations, and Bilibili’s current SLO alert configuration.

05 Quality Operations

With SLI and SLO alerts in place, Bilibili builds comprehensive quality dashboards covering business SLI, core‑function health, application health, and cross‑layer health matrices.

06 Conclusion

SLOs are essential for SRE: they define service capabilities, trigger timely alerts, drive stability improvements, and enable collaborative quality enhancement between SRE and product teams.

References: Bilibili SRE articles, Google SRE Workbook (implementing SLOs, alerting on SLOs).

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Monitoring Metrics Alerting SLO Site Reliability Engineering

Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.