Operations 21 min read

Lessons Learned from Implementing SLOs at Bilibili: Practices, Pitfalls, and Reflections

Bilibili adopted Google‑SRE SLO practices—selecting SLIs, defining availability and latency targets, grading services, and tracking error budgets—but encountered costly grading inconsistencies, hidden error detection, and inaccurate business‑level metrics, leading them to realize SLOs are chiefly valuable for early alerting rather than exhaustive reporting.

Bilibili Tech

Aug 2, 2022

Lessons Learned from Implementing SLOs at Bilibili: Practices, Pitfalls, and Reflections

01 Background

In recent years, Google SRE has become very popular in China. The Google SRE methodology emphasizes that SLO is the core of SRE practice; SLO sets a target level for service reliability and is a key factor for reliability decisions. This article discusses how Bilibili chooses and calculates SLI, sets SLO, and puts the practice into production, sharing the pitfalls and experience.

02 Definition of SLO in Google SRE

Service Level Objective (SLO) specifies the target level of service reliability. Because SLOs are the key for data‑driven reliability decisions, they are the core of SRE practice.

The original Google SRE "Site Reliability Handbook" states that SLOs are needed for:

Engineers are scarce; time should be spent on core problems of important services.

SLOs are crucial for prioritising work and reliability‑related tasks.

SRE’s core responsibilities include automation and incident handling, but daily work must follow SLOs.

Without SLOs, there is no SRE.

For error‑budget‑based reliability engineering, the handbook also stresses:

Stakeholders must recognise the SLO.

The service can meet the SLO under normal conditions.

The organisation must accept the error budget and use it in decision‑making.

A complete SLO process must exist.

Otherwise, SLO compliance becomes a KPI or reporting metric rather than a decision‑making tool.

03 Implementation of SLO in Google SRE

The second chapter of the handbook outlines the SLO implementation process:

1. SLI Selection

For request‑driven services, typical SLIs are availability (success‑response ratio), latency, and quality.

2. SLI Calculation

SLI can be calculated from application server logs, load‑balancer monitoring, black‑box monitoring, client plugins, etc.

Load‑balancer metrics are usually chosen because they represent the total request processing time across all modules and network hops, and they are cheaper to implement than client plugins.

3. SLO Definition

Based on the calculated availability and latency data, define appropriate service SLOs.

Example: annual availability ≥ 99.99%.

Example: 99% of requests ≤ 200 ms, 90% ≤ 100 ms.

SLOs can be defined for different time windows (monthly, quarterly, etc.).

Obtain stakeholder approval.

4. Error Budget

With SLI and SLO, the allowed number of failures in a time window is known.

If the error budget is exhausted, adopt strategies such as:

5. Record SLO and Error Budget

Document author, reviewer, approval date, next review date, background, etc.

Track platform, process, policy, and change events for traceability.

Detail SLI implementation, calculation, and error‑budget usage.

6. Dashboards and Reports

Provide published SLOs, error budgets, and visual dashboards/reports.

7. Continuous Improvement of SLOs

04 Our SLO Practice

From Google’s SLO description we extracted key information to guide our own construction.

Service Grading

Application (Technical View) – one appid per application, includes front‑end and back‑end, can be built and deployed independently.

Business (Product View) – a set of related product functions, relatively independent business modules, contains a group of related applications.

Grading Levels (L0‑L3)

The grading is applied first to the business, then to applications under the business, and finally to APIs, ensuring that an API’s grade never exceeds its application’s grade.

SLO System

1. SLI Selection

For online services we choose availability, latency, and throughput.

Availability is measured by error count and request success rate.

Latency is measured by p90 and p99.

Throughput is measured by total daily requests and request rate.

2. SLI Model

API metrics reflect business functionality; we measure API availability, latency, and throughput.

Only L0 and L1 APIs are measured.

Business SLI aggregates the selected API SLIs, focusing on availability.

3. SLI Calculation

Use load‑balancer (SLB) metrics for all public‑facing services.

Internal services are measured indirectly via public‑facing services.

API availability: per‑minute error count (HTTP 5XX), total requests, success rate, latency percentiles, throughput; aggregated daily.

Business availability: aggregate error count and success rate across L0 and L1 APIs, weighted by level.

4. SLO Definition

Define SLOs only for availability and latency; throughput is shown in dashboards.

Default API SLOs based on grade (e.g., L0 API availability ≥ 99.99%, L0 API 99th‑percentile latency ≤ 200 ms).

Business‑level SLOs target annual availability ≥ 99.99%.

5. Error Budget

Initially we did not focus on error‑budget construction; we only used SLO‑based alerts.

6. Record SLO and Error Budget

Provide a platform to support SLO definition, review, and recording.

7. Dashboards and Reports

API dashboards and reports.

05 Problems Encountered

1. Business Grading

Inconsistent business abstraction; product scope can be large or small.

L0 and core L1 grading is fairly accurate, but other business grades are uncertain.

2. Application Grading

Number of applications far exceeds number of businesses, making grading difficult.

Legacy applications lag behind business refactoring.

Grades for non‑core applications have low accuracy.

3. Interface Grading

Even more interfaces than applications; grading is time‑consuming and costly.

Interface grades lag behind business changes.

4. SLI Calculation

When upstream/downstream failures cause HTTP 200 responses, real errors are hidden in business error codes; SLB cannot parse the body.

This can lead to a half‑hour outage being reported as 100% availability.

Switching to application‑reported metrics (Prometheus) solves the issue, but when the app is down it cannot report metrics.

Changing business error codes to HTTP codes is costly and conflicts with micro‑service standards.

5. Business SLI

Need to filter APIs that affect core business functions; other APIs should not impact business SLI.

API changes during product iteration can render previous SLI data irrelevant.

6. Summary of Problems

Grading model is idealistic and costly.

Business‑API SLI relationships and metadata updates are delayed, reducing data accuracy.

Single‑metric availability SLI cannot cover all failure scenarios.

Some departments only provide internal services, lacking SLI data.

SLI data is mainly used for reports and provides little other value.

We attempted to use business availability SLI as the annual availability report, but the calculation diverged significantly from fault‑time‑based availability, leading to a costly data‑compensation mechanism that required manual incident review and SLI correction. Ultimately, the obsession with perfect SLI accuracy caused the entire SLO system to stall.

06 Reflection

We questioned where the problem lay and what the true value of SLO is. SLO’s value is not mandatory; its core is to enable timely alerts for SLI anomalies. Without proper alerting, SLO becomes a mere report.

Key reflections:

SLO is a decision‑making factor, not an absolute requirement.

The main value of SLO is early detection of alerts that affect availability SLI.

Error‑budget strategy should guide short‑term goals after a quarter‑level incident, but freezing product iteration for months is unrealistic.

SLI measurement should focus on business functions and applications, not on aggregated business SLI.

07 Conclusion

This article detailed our thought process, implementation steps, and the problems we encountered while building an SLO system without fully understanding its value. In the next article we will share the revised SLO construction approach based on our new insights.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Cloud Native Operations SRE Reliability Engineering SLO Error Budget

Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.